Archive for the ‘Tweets’ Category

Tweet NLP

Tuesday, October 21st, 2014

TWeet NLP (Carnegie Mellon)

From the webpage:

We provide a tokenizer, a part-of-speech tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools.

See the website for further details.

I can understand vendors mining tweets and try to react to every twitch in some social stream but the U.S. military is interested as well.

“Customer targeting” in their case has a whole different meaning.

Assuming you can identify one or more classes of tweets, would it be possible to mimic those patterns, albeit with some deviation in the content of the tweets? That is what tweet content is weighted heavier that other tweet content?

I first saw this in a tweet by Peter Skomoroch.

Twitter Mapping: Foundations

Sunday, October 12th, 2014

Twitter Mapping: Foundations by Simon Rogers.

From the post:

With more than 500 million tweets sent every day, Twitter data as a whole can seem huge and unimaginable, like cramming the contents of the Library of Congress into your living room.

One way of trying to make that big data understandable is by making it smaller and easier to handle by giving it context; by putting it on a map.

It’s something I do a lot—I’ve published over 1,000 maps in the past five years, mostly at Guardian Data. At Twitter, with 77% of users outside the US, it’s often aimed at seeing if regional variations can give us a global picture, an insight into the way a story spreads around the globe. Here’s what I’ve learned about using Twitter data on maps.

… (lots of really cool maps and links omitted)

Creating data visualizations is simpler now than it’s ever been, with a plethora of tools (free and paid) meaning that any journalist working in any newsroom can make a chart or a map in a matter of minutes. Because of time constraints, we often use CartoDB to animate maps of tweets over time. The process is straightforward—I’ve written a how-to guide on my blog that shows how to create an animated map of dots using the basic interface, and if the data is not too big it won’t cost you anything. CartoDB is also handy for other reasons: as it has access to Twitter data, you can use it to get the geotagged tweets too. And it’s not the only one: Trendsmap is a great way to see location of conversations over time.

Have you made a map with Twitter Data that tells a compelling story? Share it with us via @TwitterData.

While composing this post I looked at CartoDB solution for geotagged tweets and while impressive, it is currently in beta with a starting price of $300/month. Works if you get your expenses paid but a bit pricey for occasional use.

There is a free option for CartoDB (up to 50 MB of data) but I don’t think it includes the twitter capabilities.

Sample mapping tweets on your favorite issues. Maps are persuasive in ways that are not completely understood.

Twitter and the Arab Spring

Sunday, September 7th, 2014

You may remember that “effective use of social media” was claimed as a hallmark of the Arab Spring. (The Arab Spring and the impact of social media and Opening Closed Regimes: What Was the Role of Social Media During the Arab Spring?)

When evaluating such claims remember that your experience with social media may or may not represent the experience with social media elsewhere.

For example, Citizen Engagement and Public Services in the Arab World: The Potential of Social Media from Mohammed Bin Rashid School of Government (2014) reports:

Figure 23: Egypt 22.4% Facebook User Penetration

Figure 34: Egypt 1.26% Twitter user penetration rate.

Those figures are as of 2014. Figures for prior years are smaller.

That doesn’t sound like a level of social media necessary for create and then drive a social movement like the Arab Spring.

You can find additional datasets and additional information at: Registration is free.

And check out: Mohammed Bin Rashid School of Government

I first saw this in a tweet by Peter W. Singer.

Summingbird:… [VLDB 2014]

Monday, August 4th, 2014

Summingbird: A Framework for Integrating Batch and Online MapReduce Computations by Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin.


Summingbird is an open-source domain-specifi c language implemented in Scala and designed to integrate online and batch MapReduce computations in a single framework. Summingbird programs are written using data flow abstractions such as sources, sinks, and stores, and can run on diff erent execution platforms: Hadoop for batch processing (via Scalding/Cascading) and Storm for online processing. Different execution modes require di fferent bindings for the data flow abstractions (e.g., HDFS files or message queues for the source) but do not require any changes to the program logic. Furthermore, Summingbird can operate in a hybrid processing mode that transparently integrates batch and online results to efficiently generate up-to-date aggregations over long time spans. The language was designed to improve developer productivity and address pain points in building analytics solutions at Twitter where often, the same code needs to be written twice (once for batch processing and again for online processing) and indefi nitely maintained in parallel. Our key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion. This means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice we have not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.

Heavy sledding but deeply interesting work. Particularly about “…integrating batch and online processing in a seamless fashion.”

I first saw this in a tweet by Jimmy Lin.

Testing LDA

Wednesday, July 2nd, 2014

Using Latent Dirichlet Allocation to Categorize My Twitter Feed by Joseph Misiti.

From the post:

Over the past 3 years, I have tweeted about 4100 times, mostly URLS, and mostly about machine learning, statistics, big data, etc. I spent some time this past weekend seeing if I could categorize the tweets using Latent Dirichlet Allocation. For a great introduction to Latent Dirichlet Allocation (LDA), you can read the following link here. For the more mathematically inclined, you can read through this excellent paper which explains LDA in a lot more detail.

The first step to categorizing my tweets was pulling the data. I initially downloaded and installed Twython and tried to pull all of my tweets using the Twitter API, but that quickly realized there was an archive button under settings. So I stopped writing code and just double clicked the archive button. Apparently 4100 tweets is fairly easy to archive, because I received an email from Twitter within 15 seconds with a download link.

When you read Joseph’s post, note that he doesn’t use the content of his tweets but rather the content of the URLs he tweeted as the subject of the LDA analysis.

Still a valid corpus for LDA analysis but I would not characterize it as “categorizing” his tweet feed, meaning the tweets, but rather “categorizing” the content he tweeted about. Not the same thing.

A useful exercise because it uses LDA on a corpus with which you should be familiar, the materials you tweeted about.

As opposed to using LDA on a corpus that is less well known to you and you are reduced to running sanity checks with no real feel for the data.

It would be an interesting exercise, to discover the top topics for the corpus you tweeted about (Joseph’s post) and also for the corpus of #tags that you used in your tweets. Are they the same or different?

I first saw this in a tweet by Christophe Lalanne.

Twitter and Refusing Service

Tuesday, June 17th, 2014

Twitter struggles to remain the free-speech wing of the free-speech party as it suspends terrorist accounts by Mathew Ingram.

Mathew’s headline must be one of those “click-bait” things I keep hearing about.

When followed the link, I was expecting to find that Twitter had suspended the Twitter accounts of Oliver North, Donald Rumsfeld, etc.

No such luck.

What did happen was:

Twitter recently suspended the account belonging to the Islamic State in Iraq and Syria (ISIS) after the group — which claims to represent radical Sunni militants — posted photographs of its activities, including what appeared to be a mass execution in Iraq. The service has also suspended other accounts related to the group for what seem to be similar reasons, including one that live-tweeted the group’s advance into the city of Mosul.

“Terrorism” and “terrorist” depends upon your current side. As I understand recent news, Iran is about to become a United States ally in the Middle East instead of a component of the axis of evil (as per George W. Bush). Amazing the difference that only twelve (12) years make.

A new service motto for Twitter:

Twitter reserves the right to refuse service to anyone at any time.

I know who that motto serves.

Do you?

Ethics and Big Data

Monday, May 26th, 2014

Ethical research standards in a world of big data by Caitlin M. Rivers and Bryan L. Lewis.


In 2009 Ginsberg et al. reported using Google search query volume to estimate influenza activity in advance of traditional methodologies. It was a groundbreaking example of digital disease detection, and it still remains illustrative of the power of gathering data from the internet for important research. In recent years, the methodologies have been extended to include new topics and data sources; Twitter in particular has been used for surveillance of influenza-like-illnesses, political sentiments, and even behavioral risk factors like sentiments about childhood vaccination programs. As the research landscape continuously changes, the protection of human subjects in online research needs to keep pace. Here we propose a number of guidelines for ensuring that the work done by digital researchers is supported by ethical-use principles. Our proposed guidelines include: 1) Study designs using Twitter-derived data should be transparent and readily available to the public. 2) The context in which a tweet is sent should be respected by researchers. 3) All data that could be used to identify tweet authors, including geolocations, should be secured. 4) No information collected from Twitter should be used to procure more data about tweet authors from other sources. 5) Study designs that require data collection from a few individuals rather than aggregate analysis require Institutional Review Board (IRB) approval. 6) Researchers should adhere to a user’s attempt to control his or her data by respecting privacy settings. As researchers, we believe that a discourse within the research community is needed to ensure protection of research subjects. These guidelines are offered to help start this discourse and to lay the foundations for the ethical use of Twitter data.

I am curious who is going to follow this suggested code of ethics?

Without long consideration, obviously not the NSA, FBI, CIA, DoD, or any employee of the United States government.

Ditto for the security services in any country plus their governments.

Industry players are well known for their near perfect recidivism rate on corporate crime so not expecting big data ethics there.

Drug cartels? Anyone shipping cocaine in multi-kilogram lots is unlikely to be interested in Big Data ethics.

That rather narrows the pool of prospective users of a code of ethics for big data doesn’t it?

I first saw this in a tweet by Ed Yong.

Create Dataset of Users from the Twitter API

Saturday, May 17th, 2014

Create Dataset of Users from the Twitter API by Ryan Swanson.

From the post:

This project provides an example of using python to pull user data from Twitter.

This project will create a dataset of the top 1000 twitter users for any given search topic.

As written, the project returns these values:

  1. handle – twitter username | string
  2. name – full name of the twitter user | string
  3. age – number of days the user has existed on twitter | number
  4. numOfTweets – number of tweets this user has created (includes retweets) | number
  5. hasProfile – 1 if the user has created a profile description, 0 otherwise | boolean
  6. hasPic – 1 if the user has setup a profile pic, 0 otherwise | boolean
  7. numFollowing – number of other twitter users, this user is following | number
  8. numOfFavorites – number of tweets the user has favorited | number
  9. numOfLists – number of public lists this user has been added to | number
  10. numOfFollowers – number of other users following this user | number

You need to read the Twitter documentation if you want to extend this project to capture other values. Such as a list of followers or who someone is following, important for sketching communities for example. Or tracing tweets/retweets across a community.


Twitter User Targeting Data

Sunday, May 11th, 2014

Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization by Ryan Compton, David Jurgens, and, David Allen.


Geographically annotated social media is extremely valuable for modern information retrieval. However, when researchers can only access publicly-visible data, one quickly finds that social media users rarely publish location information. In this work, we provide a method which can geolocate the overwhelming majority of active Twitter users, independent of their location sharing preferences, using only publicly-visible Twitter data.

Our method infers an unknown user’s location by examining their friend’s locations. We frame the geotagging problem as an optimization over a social network with a total variation-based objective and provide a scalable and distributed algorithm for its solution. Furthermore, we show how a robust estimate of the geographic dispersion of each user’s ego network can be used as a per-user accuracy measure, allowing us to discard poor location inferences and control the overall error of our approach.

Leave-many-out evaluation shows that our method is able to infer location for 101,846,236 Twitter users at a median error of 6.33 km, allowing us to geotag roughly 89\% of public tweets.

If 6.33 km sounds like a lot of error, check out NUKEMAP by Alex Wellerstein.

Scalding 0.9: Get it while it’s hot!

Thursday, April 10th, 2014

Scalding 0.9: Get it while it’s hot! by P. Oscar Boykin.

From the post:

It’s been just over two years since we open sourced Scalding and today we are very excited to release the 0.9 version. Scalding at Twitter powers everything from internal and external facing dashboards, to custom relevance and ad targeting algorithms, including many graph algorithms such as PageRank, approximate user cosine similarity and many more.

Oscar covers:

  • Joins
  • Input/output
    • Parquet Format
    • Avro
    • TemplateTap
  • Hadoop counters
  • Typed API
  • Matrix API

Or if you want something a bit more visual and just as enthusiastic, see:

Basically the same content but with Oscar live!

Building a tweet ranking web app using Neo4j

Wednesday, March 12th, 2014

Building a tweet ranking web app using Neo4j by William Lyon.

From the post:


I spent this past weekend hunkered down in the basement of the local Elk’s club, working on a project for a hackathon. The project was a tweet ranking web application. The idea was to build a web app that would allow users to login with their Twitter account and view a modified version of their Twitter timeline that shows them tweets ranked by importance. Spending hours every day scrolling through your timeline to keep up with what’s happening in your Twitter network? No more, with Twizzard!

The project uses the following components:

  • Node.js web application (using Express framework)
  • MongoDB database for storing basic user data
  • Integration with Twitter API, allowing for Twitter authentication
  • Python script for fetching Twitter data from Twitter API
  • Neo4j graph database for storing Twitter network data
  • Neo4j unmanaged server extension, providing additional REST endpoint for querying / retrieving ranked timelines per user

Looks like a great project and good practice as well!

Curious what you think of the ranking of tweets:

How can we score Tweets to show users their most important Tweets? Users are more likely to be interested in tweets from users they are more similar to and from users they interact with the most. We can calculate metrics to represent these relationships between users, adding an inverse time decay function to ensure that the content at the top of their timeline stays fresh.

That’s one measure of “importance.” Being able to assign a rank would be useful as well, say for the British Library.

Do take notice of the Jaccard similarity index.

Would you say that possessing at least one identical string (id, subject identifier, subject indicator) is a form of similarity measure?

What other types of similarity measures do you think would be useful for topic maps?

I first saw this in a tweet by GraphemeDB.

Mapping Twitter Topic Networks:…

Thursday, February 20th, 2014

Mapping Twitter Topic Networks: From Polarized Crowds to Community Clusters by Marc A. Smith, Lee Rainie, Ben Shneiderman and Itai Himelboim.

From the post:

Conversations on Twitter create networks with identifiable contours as people reply to and mention one another in their tweets. These conversational structures differ, depending on the subject and the people driving the conversation. Six structures are regularly observed: divided, unified, fragmented, clustered, and inward and outward hub and spoke structures. These are created as individuals choose whom to reply to or mention in their Twitter messages and the structures tell a story about the nature of the conversation.

If a topic is political, it is common to see two separate, polarized crowds take shape. They form two distinct discussion groups that mostly do not interact with each other. Frequently these are recognizably liberal or conservative groups. The participants within each separate group commonly mention very different collections of website URLs and use distinct hashtags and words. The split is clearly evident in many highly controversial discussions: people in clusters that we identified as liberal used URLs for mainstream news websites, while groups we identified as conservative used links to conservative news websites and commentary sources. At the center of each group are discussion leaders, the prominent people who are widely replied to or mentioned in the discussion. In polarized discussions, each group links to a different set of influential people or organizations that can be found at the center of each conversation cluster.

While these polarized crowds are common in political conversations on Twitter, it is important to remember that the people who take the time to post and talk about political issues on Twitter are a special group. Unlike many other Twitter members, they pay attention to issues, politicians, and political news, so their conversations are not representative of the views of the full Twitterverse. Moreover, Twitter users are only 18% of internet users and 14% of the overall adult population. Their demographic profile is not reflective of the full population. Additionally, other work by the Pew Research Center has shown that tweeters’ reactions to events are often at odds with overall public opinion— sometimes being more liberal, but not always. Finally, forthcoming survey findings from Pew Research will explore the relatively modest size of the social networking population who exchange political content in their network.

Great study on political networks but all the more interesting for introducing an element of sanity into discussions about Twitter.

At a minimum, Twitter having 18% of all Internet users and 14% of the overall adult population casts serious doubt on metrics using Twitter to rate software popularity. (“It’s all we have” is a pretty lame excuse for using bad metrics.)

Not to say it isn’t important to mine Twitter data for what content it holds but at the same time to remember Twitter isn’t the world.

I first saw this at Mapping Twitter Topic Networks: From Polarized Crowds to Community Clusters by FullTextReports.

Spring XD – Tweets – Hadoop – Sentiment Analysis

Saturday, February 15th, 2014

Using Spring XD to stream Tweets to Hadoop for Sentiment Analysis

From the webpage:

This tutorial will build on the previous tutorial – 13 – Refining and Visualizing Sentiment Data – by using Spring XD to stream in tweets to HDFS. Once in HDFS, we’ll use Apache Hive to process and analyze them, before visualizing in a tool.

I re-ordered the text:

This tutorial is from the Community part of tutorial for Hortonworks Sandbox (1.3) – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series.

This community tutorial submitted by mehzer with source available at Github. Feel free to contribute edits or your own tutorial and help the community learn Hadoop.

not to take anything away from Spring XD or Sentiment Analysis but to emphasize the community tutorial aspects of the Hortonworks Sandbox.

At present count on tutorials:

Hortonworks: 14

Partners: 12

Community: 6

Thoughts on what the next community tutorial should be?

Twitter Keyboard Shortcuts

Thursday, February 13th, 2014

Twitter Keyboard Shortcuts by Gregory Piatetsky.

Too useful not to pass along.

Gregory says the best shortcut is “?.” Gives you all the keyboard shortcuts.

Pass it on.

Twitter Data Grants [Following 0 Followers 524,870 + 1]

Friday, February 7th, 2014

Introducing Twitter Data Grants by Raffi Krikorian.

Deadline: March 15, 2014

From the post:

Today we’re introducing a pilot project we’re calling Twitter Data Grants, through which we’ll give a handful of research institutions access to our public and historical data.

With more than 500 million Tweets a day, Twitter has an expansive set of data from which we can glean insights and learn about a variety of topics, from health-related information such as when and where the flu may hit to global events like ringing in the new year. To date, it has been challenging for researchers outside the company who are tackling big questions to collaborate with us to access our public, historical data. Our Data Grants program aims to change that by connecting research institutions and academics with the data they need.


If you’d like to participate, submit a proposal here no later than March 15th. For this initial pilot, we’ll select a small number of proposals to receive free datasets. We can do this thanks to Gnip, one of our certified data reseller partners. They are working with us to give selected institutions free and easy access to Twitter datasets. In addition to the data, we will also be offering opportunities for the selected institutions to collaborate with Twitter engineers and researchers.

We encourage those of you at research institutions using Twitter data to send in your best proposals. To get updates and stay in touch with the program: visit, make sure to follow @TwitterEng, or email with questions.

You may want to look at Twitter Engineering to see what has been of recent interest.

Tracking social media during the Arab Spring to separate journalists from participants could be interesting.

BTW, a factoid for today: @TwitterEng had 524,870 followers and 0 following when I first saw the grant page. Now they have 524,871 followers and 0 following. 😉

There’s another question: Who has the best following/follower ratio? Any patterns there?

I first saw this in a tweet by Gregory Piatetsky.

Visualize your Twitter followers…

Thursday, January 30th, 2014

Visualize your Twitter followers in 3 fairly easy — and totally free — steps by Derrick Harris.

From the post:

Twitter is a great service, but it’s not exactly easy for users without programming skills to access their account data, much less do anything with it. Until now.

There already are services that will let you download reports about when you tweet and which of your tweets were the most popular, some — like SimplyMeasured and FollowerWonk — will even summarize data about your followers. If you’re willing to wait hours to days (Twitter’s API rate limits are just that — limiting) and play around with open source software, NodeXL will help you build your own social graph. (I started and gave up after realizing how long it would take if you have more than a handful of followers.) But you never really see the raw data, so you have to trust the services and you have to hope they present the information you want to see.

Then, last week, someone from ScraperWiki tweeted at me, noting that service can now gather raw data about users’ accounts. (I’ve used the service before to measure tweet activity.) I was intrigued. But I didn’t want to just see the data in a table, I wanted to do something more with it. Here’s what I did.

Another illustration that the technology expertise gap between users does not matter as much as the imagination gap between users.

The Google Fusion Table image is quite good.


Saturday, January 25th, 2014


From the about page:

Welcome to the Flax UKMP application, providing search and analysis of tweets posted by UK members of Parliament.

This application started life during a hackday organised by Flax for the Enterprise Search Meetup for Cambridge in the UK. During the day, the participants split into groups working on a number of different activities, two of them using a sample of Twitter data from members of the UK Parliament’s House of Commons. By the end of the day, both groups had small web applications available which used the data in slightly different ways. Both of those applications have been used in the construction of this site.

The content is gathered from four Twitter lists: one for each of the Conservative, Labour and Liberal Democrat parties, and a further list for the smaller parties. We extract the relevant data, and use the Stanford NLP software to extract entities (organisations, people and locations) mentioned in the tweet, and feed the tweets into a Solr search engine. The tweets are then made available to view, filter and search on the Browse page.

The source code is available in the Flax github repository. Do let us know what you think.

I don’t think programmers are in danger from projects like this one, primarily because they work with the “data” and don’t necessarily ingest a lot of it.

Readers, testers on the other hand, I fear that in sufficient quantities, tweets from politicians could make the average reader dumber by the minute.

As a safety precaution, particularly in the United States, have multiple copies of Shakespeare and Dante about, just in case anyone’s mind seizes while reading such drivel.

Readers should also be cautioned to wait at least 10 to 15 minutes for they attempt to operate a motor vehicle. 😉

The Road to Summingbird:…

Sunday, January 12th, 2014

The Road to Summingbird: Stream Processing at (Every) Scale by Sam Ritchie.


Twitter’s Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with realtime systems at scale.

But what if your project is not quite at “scale” yet? Should you ignore scale until it becomes a problem, or swallow the pill ahead of time? Is using Summingbird overkill for small projects? I argue that it’s not. This talk will discuss the ideas and components of Summingbird that you could, and SHOULD, use in your startup’s code from day one. You’ll come away with a new appreciation for monoids and semigroups and a thirst for abstract algebra.

A slide deck that will make you regret missing the presentation.

I wasn’t able to find a video of Sam’s presentation at Data Day Texas 2014, but I did find a collection of his presentations, including some videos, at:

Valuable lessons for startups and others.

Twitter Weather Radar – Test Data for Language Analytics

Sunday, December 22nd, 2013

Twitter Weather Radar – Test Data for Language Analytics by Nicholas Hartman.

From the post:

Today we’d like to share with you some fun charts that have come out of our internal linguistics research efforts. Specifically, studying weather events by analyzing social media traffic from Twitter.

We do not specialize in social media and most of our data analytics work focuses on the internal operations of leading organizations. Why then would we bother playing around with Twitter data? In short, because it’s good practice. Twitter data mimics a lot of the challenges we face when analyzing the free text streams generated by complex processes. Specifically:

  • High Volume: The analysis represented here is looking at around 1 million tweets a day. In the grand scheme of things, that’s not a lot but we’re intentionally running the analysis on a small server. That forces us to write code that rapidly assess what’s relevant to the question we’re trying to answer and what’s not. In this case the raw tweets were quickly tested live on receipt with about 90% of them discarded. The remaining 10% were passed onto the analytics code.
  • Messy Language: A lot of text analytics exercises I’ve seen published use books and news articles as their testing ground. That’s fine if you’re trying to write code to analyze books or news articles, but most of the world’s text is not written with such clean and polished prose. The types of text we encounter (e.g., worklogs from an IT incident management system) are full of slang, incomplete sentences and typos. Our language code needs to be good and determining the messages contained within this messy text.
  • Varying Signal to Noise: The incoming stream of tweets will always contain a certain percentage of data that isn’t relevant to the item we’re studying. For example, if a band member from One Direction tweets something even tangentially related to what some code is scanning for the dataset can be suddenly overwhelmed with a lot of off-topic tweets. Real world data is similarly has a lot of unexpected noise.

In the exercise below, tweets from Twitter’s streaming API JSON stream were scanned in near real-time for their ability to 1) be pinpointed to a specific location and 2) provide potential details on local weather conditions. The vast majority of tweets passing through our code failed to meet both of these conditions. The tweets that remained were analyzed to determine the type of precipitation being discussed.

An interesting reminder that data to test your data mining/analytics is never far away.

If not Twitter, pick one of the numerous email archives or open data datasets.

The post doesn’t offer any substantial technical details but then you need to work those out for yourself.

How to spot first stories on Twitter using Storm

Wednesday, November 27th, 2013

How to spot first stories on Twitter using Storm by Michael Vogiatzis.

From the post:

As a first blog post, I decided to describe a way to detect first stories (a.k.a new events) on Twitter as they happen. This work is part of the Thesis I wrote last year for my MSc in Computer Science in the University of Edinburgh.You can find the document here.

Every day, thousands of posts share information about news, events, automatic updates (weather, songs) and personal information. The information published can be retrieved and analyzed in a news detection approach. The immediate spread of events on Twitter combined with the large number of Twitter users prove it suitable for first stories extraction. Towards this direction, this project deals with a distributed real-time first story detection (FSD) using Twitter on top of Storm. Specifically, I try to identify the first document in a stream of documents, which discusses about a specific event. Let’s have a look into the implementation of the methods used.

Other resources of interest:

Slide deck by the same name.

Code on Github.

The slides were interesting and were what prompted me to search for and find the blog and Github materials.

An interesting extension to this technique would be to discover “new” ideas in papers.

Or particular classes of “new” ideas in news streams.


Thursday, October 10th, 2013

The Ultimate Who-To-Follow Guide for Tweeting Librarians, Info Pros, and Educators by Ellyssa Kroski.

Lists thirty (30) librarian feeds, then thirty (30) tech feeds (bad list construction, the ones under publication continue the tech feed list), ten (10) feeds for book lovers and pointers to three other lists of feeds.

You may need several more Twitter accounts or a better reader than most of the ones I have seen. Rules, regexes and some ML would all be useful.

Not to mention outputting the captured tweets into a topic map for navigation.

PS: I first saw this on Force11.

Referencing a Tweet…

Saturday, September 28th, 2013

Referencing a Tweet in an Academic Paper? Here’s an Automatic Citation Generator by Rebecca Rosen.

I haven’t ever thought of formally referencing a tweet but Rebecca details the required format (MLA and APA) plus points us to, a free generator for tweet citations.

If I don’t make a note of this article, someone will ask me next week how to cite a tweet.

Or worse yet, someone in a standards body will decide that tweets are appropriate for normative references.

I kid you not. I ran across a citation in a 2013 draft to a ten year old version of the XML standard.

Not so bad if in the bibliography for a cattle data paper but this was in a markup vocabulary proposal.

Inside or outside a topic map, proper and accurate citation is a courtesy to your readers.

How to Refine and Visualize Twitter Data [Al-Qaeda Bots?]

Saturday, September 14th, 2013

How to Refine and Visualize Twitter Data by Cheryle Custer.

From the post:

He loves me, he loves me not… using daisies to figure out someone’s feelings is so last century. A much better way to determine whether someone likes you, your product or your company is to do some analysis on Twitter feeds to get better data on what the public is saying. But how do you take thousands of tweets and process them? We show you how in our video – Understand your customers’ sentiments with Social Media Data – that you can capture a Twitter stream to do Sentiment Analysis.

Twitter Sentiment VisualizationNow, when you boot up your Hortonworks Sandbox today, you’ll find Tutorial 13: Refining and Visualizing Sentiment Data as the companion step-by-step guide to the video. In this Hadoop tutorial, we will show you how you can take a Twitter stream and visualize it in Excel 2013 or you could use your own favorite visualization tool. Note you can use any version of Excel, but Excel 2013 allows you do plot the data on a map where other versions will limit you to the built-in charting function.

A great tutorial from Hortonworks as always!

My only reservation is the acceptance of Twitter data for sentiment analysis.

True, it is easy to obtain, not all that difficult to process, but that isn’t the same thing as having any connection with sentiment about a company or product.

Consider that a now somewhat dated report (2012) reported that 51% of all Internet traffic is “non-human.”

If that is the case or has worsened since then, how do you account for that in your sentiment analysis?

Or if you are monitoring the Internet for Al-Qaeda threats, how do you distinguish threats from Al-Qaeda bots from threats by Al-Qaeda members?

What if threat levels are being gamed by Al-Qaeda bot networks?

Forcing expenditure of resources on a global scale at a very small cost.

A new type of asymmetric warfare?

Twitter Data Analytics

Wednesday, September 11th, 2013

Twitter Data Analytics by Shamanth Kumar, Fred Morstatter, and Huan Liu.

From the webpage:

Social media has become a major platform for information sharing. Due to its openness in sharing data, Twitter is a prime example of social media in which researchers can verify their hypotheses, and practitioners can mine interesting patterns and build realworld applications. This book takes a reader through the process of harnessing Twitter data to find answers to intriguing questions. We begin with an introduction to the process of collecting data through Twitter’s APIs and proceed to discuss strategies for curating large datasets. We then guide the reader through the process of visualizing Twitter data with realworld examples, present challenges and complexities of building visual analytic tools, and provide strategies to address these issues. We show by example how some powerful measures can be computed using various Twitter data sources. This book is designed to provide researchers, practitioners, project managers, and graduate students new to the field with an entry point to jump start their endeavors. It also serves as a convenient reference for readers seasoned in Twitter data analysis.

Preprint with data set on analyzing Twitter data.

Although running a scant seventy-nine (79) pages, including an index, Twitter Data Analytics (TDA) covers:

Each chapter end with suggestions for further reading and references.

In addition to learning more about Twitter and its APIs, the reader will be introduced to MondoDB, JUNG and D3.

No mean accomplishment for seventy-nine (79) pages!

Summingbird [Twitter open sources]

Tuesday, September 3rd, 2013

Twitter open sources Storm-Hadoop hybrid called Summingbird by Derrick Harris.

I look away for a few hours to review a specification and look what pops up:

Twitter has open sourced a system that aims to mitigate the tradeoffs between batch processing and stream processing by combining them into a hybrid system. In the case of Twitter, Hadoop handles batch processing, Storm handles stream processing, and the hybrid system is called Summingbird. It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.

Twitter’s blog post announcing Summingbird is pretty technical, but the problem is pretty easy to understand if you think about how Twitter works. Services like Trending Topics and search require real-time processing of data to be useful, but they eventually need to be accurate and probably analyzed a little more thoroughly. Storm is like a hospital’s triage unit, while Hadoop is like longer-term patient care.

This description of Summingbird from the project’s wiki does a pretty good job of explaining how it works at a high level.


While the Summingbird announcement is heavy sledding, it is well written. The projects spawned by Summingbird are rife with possibilities.

I appreciate Derrick’s comment:

It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.

I don’t know of any tools “for every job,” the opinions of some graph advocates notwithstanding. 😉

If Summingbird fits your problem set, spend some serious time seeing what it has to offer.

NoSQL Listener

Wednesday, August 28th, 2013

NoSQL Listener

From the webpage:

Aggregating NoSQL news from Twitter, from your friends at Cloudant

What twitter streams do you want to capture and post online (or process into a topic map)?

You can fork this project at GitHub.

Here’s a research idea:

Capture tweets on a possible U.S. lead conflict and separate out those from a geographic plot around the Pentagon.

Do the tweet levels or tone track U.S. military action?

BirdWatch v0.2…

Friday, August 16th, 2013

BirdWatch v0.2: Tweet Stream Analysis with AngularJS, ElasticSearch and Play Framework by Matthias Nehlsen.

From the post:

I am happy to get a huge update of the BirdWatch application out of the way. The changes are much more than what I would normally want to work on for a single article, but then again there is enough interesting stuff going on in this new version for multiple blog articles to come. Initially this application was only meant to be an exercise in streaming information to web clients. But in the meantime I have noticed that this application can be useful and interesting beyond being a mere learning exercise. Let me explain what it has evolved to:

BirdWatch is an open-source real-time tweet search engine for a defined area of interest. I am running a public instance of this for software engineering related tweets. The application subscribes to all Tweets containing at least one out of a set of terms (such as AngularJS, Java, JavaScript, MongoDB, Python, Scala, …). The application receives all those tweets immediately through the Twitter Streaming API. The limitation here is that the delivery is capped to one percent of all Tweets. This is plenty for a well defined area of interest, considering that Twitter processes more than 400 million tweets per day.

Just watching the public feed is amusing.

As Matthias says, there is a lot more that could be done with the incoming feed.

For some well defined area, you could be streaming the latest tweets on particular subjects or even who to follow, after you have harvested enough tweets.

See the project at GIthub.

Twitter4j and Scala

Monday, July 29th, 2013

Using twitter4j with Scala to access streaming tweets by Jason Baldridge.

From the introduction:

My previous post provided a walk-through for using the Twitter streaming API from the command line, but tweets can be more flexibly obtained and processed using an API for accessing Twitter using your programming language of choice. In this tutorial, I walk-through basic setup and some simple uses of the twitter4j library with Scala. Much of what I show here should be useful for those using other JVM languages like Clojure and Java. If you haven’t gone through the previous tutorial, have a look now before going on as this tutorial covers much of the same material but using twitter4j rather than HTTP requests.

I’ll introduce code, bit by bit, for accessing the Twitter data in different ways. If you get lost with what should go where, all of the code necessary to run the commands is available in this github gist, so you can compare to that as you move through the tutorial.

Update: The tutorial is set up to take you from nothing to being able to obtain tweets in various ways, but you can also get all the relevant code by looking at the twitter4j-tutorial repository. For this tutorial, the tag is v0.1.0, and you can also download a tarball of that version.

Using Twitter4j with Scala to perform user actions by Jason Baldridge.

From the introduction:

My previous post showed how to use Twitter4j in Scala to access Twitter streams. This post shows how to control a Twitter user’s actions using Twitter4j. The primary purpose of this functionality is perhaps to create interfaces for Twitter like TweetDeck, but it can also be used to create bots that take automated actions on Twitter (one bot I’m playing around with is @tshrdlu, using the code in this tutorial and the code in the tshrdlu repository).

This post will only cover a small portion of the things you can do, but they are some of the more common things and I include a couple of simple but interesting use cases. Once you have these things in place, it is straightforward to figure out how to use the Twitter4j API docs (and Stack Overflow) to do the rest.

Jason continues his tutorial on accessing/processing Twitter streams using Twitter4j and Scala.

Since Twitter has enough status for royal baby names, your data should feel no shame being on Twitter. 😉

Not to mention tweeted IRIs can inform readers of content in excess of one hundred and forty (140) characters in length.

AAAI – Weblogs and Social Media

Tuesday, July 9th, 2013

Seventh International AAAI Conference on Weblogs and Social Media

Abstracts and papers from the Seventh International AAAI Conference on Weblogs and Social Media.

Much to consider:

Frontmatter: Six (6) entries.

Full Papers: Sixty-nine (69) entries.

Poster Papers: Eighteen (18) entries.

Demonstration Papers: Five (5) entries.

Computational Personality Recognition: Ten (10) entries.

Social Computing for Workforce 2.0: Seven (7) entries.

Social Media Visualization: Four (4) entries.

When the City Meets the Citizen: Nine (9) entries.

Be aware that the links for tutorials and workshops only give you the abstracts describing the tutorials and workshops.

There is the obligatory “blind men and the elephant” paper:

Blind Men and the Elephant: Detecting Evolving Groups in Social News


We propose an automated and unsupervised methodology for a novel summarization of group behavior based on content preference. We show that graph theoretical community evolution (based on similarity of user preference for content) is effective in indexing these dynamics. Combined with text analysis that targets automatically-identified representative content for each community, our method produces a novel multi-layered representation of evolving group behavior. We demonstrate this methodology in the context of political discourse on a social news site with data that spans more than four years and find coexisting political leanings over extended periods and a disruptive external event that lead to a significant reorganization of existing patterns. Finally, where there exists no ground truth, we propose a new evaluation approach by using entropy measures as evidence of coherence along the evolution path of these groups. This methodology is valuable to designers and managers of online forums in need of granular analytics of user activity, as well as to researchers in social and political sciences who wish to extend their inquiries to large-scale data available on the web.

It is a great paper but commits a common error when it notes:

Like the parable of Blind Men and the Elephant2, these techniques provide us with disjoint, specific pieces of information.

Yes, the parable is oft told to make a point about partial knowledge, but the careful observer will ask:

How are we different from the blind men trying to determine the nature of an elephant?

Aren’t we also blind men trying to determine the nature of blind men who are examining an elephant?

And so on?

Not that being blind men should keep us from having opinions, but it should may us wary of how deeply we are attached to them.

Not only are there elephants all the way down, there are blind men before, with (including ourselves) and around us.

Twitter visualizes billions of tweets…

Saturday, June 29th, 2013

Twitter visualizes billions of tweets in artful, interactive 3D maps by Nathan Olivarez-Giles.

From the post:

On June 1st, Twitter created beautiful maps visualizing billions of geotagged tweets. Today, the social network is getting artsy once agsain, using the same dataset — which it calls Billion Strokes — to produce interactive elevation maps that render geotagged tweets in 3D. This time around, Twitter visualized geotagged tweets from San Francisco, New York, and Istanbul in maps that viewers can manipulate.

For each city map, Twitter gives users the option of adding eight different layers over the topography. Users can also change the size of the elevation differences mapped out, to get a better idea of where most tweets are sent from. The maps can be seen from either an overhead view, or on a horizontal plane. The resulting maps looking like harsh mountain ranges, but the peaks and valleys aren’t representative of the land — rather, a peak illustrates a high amount of tweets being sent from that location, while a trough displays an area where fewer tweets are sent. The whole thing was put together by Nicolas Belmonte, Twitter’s in-house data visualization scientist. You can check out the interactive maps on Twitter’s GitHub page.

I thought the contour view was the most interesting.

A visualization that shows tweet frequency by business address would be interesting as well.

Are more tweets sent from movie theaters or churches?