Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 3, 2012

Awesome website for #rstats Mining Twitter using R

Filed under: Data Mining,Graphics,R,Tweets,Visualization — Patrick Durusau @ 7:33 pm

Awesome website for #rstats Mining Twitter using R by Ajay Ohri

From the post:

Just came across this very awesome website.

Did you know there were six kinds of wordclouds in R.

(giggles like a little boy)

https://sites.google.com/site/miningtwitter/questions/talking-about

No, I can honestly say I was unaware “…there were six kinds of wordclouds in R.” 😉

Still, it might be a useful think to know at some point in the future.

May 31, 2012

From Tweets to Results: How to obtain, mine, and analyze Twitter data

Filed under: Data Source,Data Streams,Tweets — Patrick Durusau @ 4:16 pm

From Tweets to Results: How to obtain, mine, and analyze Twitter data by Derek Ruths (McGill University).

Description:

Since Twitter’s creation in 2006, it has become one of the most popular microblogging platforms in the world. By virtue of its popularity, the relative structural simplicity of Twitter posts, and a tendency towards relaxed privacy settings, Twitter has also become a popular data source for research on a range of topics in sociology, psychology, political science, and anthropology. Nonetheless, despite its widespread use in the research community, there are many pitfalls when working with Twitter data.

In this day-long workshop, we will lead participants through the entire Twitter-based research pipeline: from obtaining Twitter data all the way through performing some of the sophisticated analyses that have been featured in recent high-profile publications. In the morning, we will cover the nuts and bolts of obtaining and working with a Twitter dataset including: using the Twitter API, the firehose, and rate limits; strategies for storing and filtering Twitter data; and how to publish your dataset for other researchers to use. In the afternoon, we will delve into techniques for analyzing Twitter content including constructing retweet, mention, and follower networks; measuring the sentiment of tweets; and inferring the gender of users from their profiles and unstructured text.

We assume that participants will have little to no prior experience with mining Twitter or other social network datasets. As the workshop will be interactive, participants are encouraged to bring a laptop. Code examples and exercises will be given in Python, thus participants should have some familiarity with the language. However, all concepts and techniques covered will be language-independent, so any individual with some background in scripting or programming will benefit from the workshop.

Any plans to use Twitter feeds for your topic maps?

I first saw a reference to this workshop at: Do you haz teh (twitter) codez? by Ricard Nielson.

May 25, 2012

Build your own twitter like real time analytics – a step by step guide

Filed under: Analytics,Cassandra,Tweets — Patrick Durusau @ 4:22 am

Build your own twitter like real time analytics – a step by step guide

Where else but High Scalability would you find a “how-to” article like this one? Complete with guide and source code.

Good DYI project for the weekend.

Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.

In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation:

  1. Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
  2. BigData data-base (Cassandra) for storing the historical data and manage the trend analytics
  3. Use Cloudify (cloudifysource.org) for managing and automating the deployment on private or pubic cloud

The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique.

Then we need persist and process the data with low latency, and for this we store the tweets in memory.

Automated harvesting of tweets has real potential, even with clear text transmission. Or perhaps because of it.

May 4, 2012

Processing & Twitter

Filed under: Graphics,Processing,Tweets,Visualization — Patrick Durusau @ 3:41 pm

Processing & Twitter

Jer writes:

** Since I first released this tutorial in 2009, it has received thousands of views and has hopefully helped some of you get started with building projects incorporating Twitter with Processing. In late 2010, Twitter changed the way that authorization works, so I’ve updated the tutorial to get it inline with the new Twitter API functionality.

Accessing information from the Twitter API with Processing is (reasonably) easy. A few people have sent me e-mails asking how it all works, so I thought I’d write a very quick tutorial to get everyone up on their feet.

We don’t need to know too much about how the Twitter API functions, because someone has put together a very useful Java library to do all of the dirty work for us. It’s called twitter4j, and you can download it here. We’ll be using this in the first step of the building section of this tutorial.

A nice introduction to Twitter (an information stream) and Processing (a visualization language).

Both of which may find their way into your topic maps.

April 3, 2012

Tracking Video Game Buzz

Filed under: Blogs,Clustering,Data Mining,Tweets — Patrick Durusau @ 4:17 pm

Tracking Video Game Buzz

Matthew Hurst writes:

Briefly, I pushed out an experimental version of track // games to track tropics in the blogosphere relating to video games. As with track // microsoft it uses gathers posts from blogs, clusters them and uses an attention metric based on Bitly and Twitter to rank the clusters, new posts and videos.

Currently at the top of the stack is Bungie Waves Goodbye To Halo.

Wonder if Matthew could be persuaded to do the same for the elections this Fall in the United States? 😉

March 8, 2012

Twitter Current English Lexicon

Filed under: Dataset,Lexicon,Tweets — Patrick Durusau @ 8:50 pm

Twitter Current English Lexicon

From the description:

Twitter Current English Lexicon: Based on the Twitter Stratified Random Sample Corpus, we regularly extract the Twitter Current English Lexicon. Basically, we’re 1) pulling all tweets from the last three months of corpus entries that have been marked as “English” by the collection process (we have to make that call because there is no reliable means provided by Twitter), 2) removing all #hash, @at, and http items, 3) breaking the tweets into tokens, 4) building descriptive and summary statistics for all token-based 1-grams and 2-grams, and 5) pushing the top 10,000 N-grams from each set into a database and text files for review. So, for every top 1-gram and 2-gram, you know how many times it occurred in the corpus, and in how many tweets (plus associated percentages).

This is an interesting set of data, particularly when you compare it with a “regular” English corpus, something traditional like the Brown Corpus. Unlike most corpora, the top token (1-gram) for Twitter is “i” (as in me, myself, and I), there are a lot of intentional misspellings, and you find an undue amount of, shall we say, “callus” language (be forewarned). It’s a brave new world if you’re willing.

To use this data set, we recommend using the database version and KwicData, but you can also use the text version. Download the ZIP file you want, unzip it, then read the README file for more explanation about what’s included.

I grabbed a copy yesterday but haven’t had the time to look at it.

Twitter feed pipeline software you would recommend?

March 3, 2012

Twitter Streaming with EventMachine and DynamoDB

Filed under: Dynamo,EventMachine,MongoDB,Tweets — Patrick Durusau @ 7:28 pm

Twitter Streaming with EventMachine and DynamoDB

From the post:

This week Amazon Web Services launched their latest database offering ‘DynamoDB’ – a highly-scalable NoSQL database service.

We’ve been using a couple of NoSQL database engines at work for a while now: Redis and MongoDB. Mongo allowed us to simplify many of our data models and represent more faithfully the underlying entities we were trying to represent in our applications and Redis is used for those projects where we need to make sure that a person only classifies an object once.1

Whether you’re using MongoDB or MySQL, scaling the performance and size of a database is non-trivial and is a skillset in itself. DynamoDB is a fully managed database service aiming to offer high-performance data storage and retrieval at any scale, regardless of request traffic or storage requirements. Unusually for Amazon Web Services, they’ve made a lot of noise about some of the underlying technologies behind DynamoDB, in particular they’ve utilised SSD hard drives for storage. I guess telling us this is designed to give us a hint at the performance characteristics we might expect from the service.

» A worked example

As with all AWS products there are a number of articles outlining how to get started with DynamoDB. This article is designed to provide an example use case where DynamoDB really shines – parsing a continual stream of data from the Twitter API. We’re going to use the Twitter streaming API to capture tweets and index them by user_id and creation time.

Wanted to include something a little different after all the graph database and modeling questions. 😉

I need to work on something like this to more effectively use Twitter as an information stream. Passing all mentions of graphs and related terms along for further processing, perhaps by a map between Twitter userIDs and known authors. Could be interesting.

February 28, 2012

Map your Twitter Friends

Filed under: Mapping,Maps,Tweets — Patrick Durusau @ 10:44 pm

Map your Twitter Friends by Nathan Yau.

From the post:

You’d think that this would’ve been done by now, but this simple mashup does exactly what the title says. Just connect your Twitter account and the people you follow popup, with some simple clustering so that people don’t get all smushed together in one location.

Too bad the FBI’s social media mining contract will be secret. Wonder how much freely available capabilities will be?

Security requirements will drive up the cost. Like secure installations where the computers have R/W DVDs installed.

Not that I carry a brief for any government, other than paying ones, but I do dislike incompetence, on any side.

February 24, 2012

Effective #hashtags

Filed under: Hashtags,Tweets — Patrick Durusau @ 4:53 pm

I ran across http://hashtagify.me/ trying to be more effective with Twitter posts about topic maps.

If I post on the graph database Neo4j, which hashtag should I use: #neo4j or #Neo4j?

In efforts to communicate, saying that listeners “need to be educated,” a phrase from #semanticweb (SemanticWeb, semanticWeb?) circles, is a poor strategy. If your goal is to communicate.

Speakers should use words and phrases listeners are likely to understand.

For Twitter, that means using the most common hash tags for any given subject.

Otherwise you are sexting with different key combinations than everyone else.

Has to be frustrating. 😉

Oh, the numbers:

Variants: neo4j:79% Neo4j:21%

Variants: semanticweb:66% SemanticWeb:31% semanticWeb:3%

With some research you can improve your Twitter communication skills.

February 6, 2012

The anatomy of a Twitter conversation, visualized with R

Filed under: R,Tweets,Visualization — Patrick Durusau @ 7:00 pm

The anatomy of a Twitter conversation, visualized with R by David Smith.

From the post:

If you’re a Twitter user like me, you’re probably familiar with the way that conversations can easily by tracked by following the #hashtag that participants include in the tweets to label the topic. But what causes some topics to take off, and others to die on the vine? Does the use of retweets (copying another users tweet to your own followers) have an impact?

I don’t think the visualization answers the questions about why some topics take off wile other don’t. Nor as far as I can tell, does it suggest any conclusion for the retweet question.

To answer the retweet question, the followers of each person would have to be known and their discussion of a retweet measured. Yes?

Still, interesting visualization technique and one that you may find handy in a current or future project.

February 2, 2012

Spot

Filed under: Tweets,Visualization — Patrick Durusau @ 3:39 pm

Spot by Jeff Clark.

From the post:

Spot is an interactive real-time Twitter visualization that uses a particle metaphor to represent tweets. The tweet particles are called spots and get organized in various configurations to illustrate information about the topic of interest.

Spot has an entry field at the lower-left corner where you can type any valid Twitter search query. The latest 200 tweets will be gathered and used for the visualization. Note that Twitter search results only go back about a week so a search for a rare topic may only return a few. When you enter a query the URL is changed so you can easily bookmark it or send it to someone…

The Different Views

Here is a complete list of the views and what they show:

  1. Group View (speech bubble icon) places tweets that share common words inside large circles
  2. Timeline View (watch icon) places tweets along a timeline based on when they were sent
  3. User View (person icon) shows a bar chart with the people sending the most tweets in the set
  4. Word View (Word Circle icon) directly shows word bubbles with tweets attracted to the words they contain
  5. Source View (Megaphone icon) a bar chart showing the tool used to send the tweets (or sometimes the news source)

What do you like/dislike about the visualization? Is it specific to Twitter or do you see adaptations that could be made for other data sets?

January 21, 2012

Chart of Congressional activity on Twitter related to SOPA/PIPA

Filed under: Graphs,Tweets,Visualization — Patrick Durusau @ 10:12 pm

Chart of Congressional activity on Twitter related to SOPA/PIPA by Drew Conway.

From the post:

As many of you know, this week thousands of people mobilized to protest two laws being considered in Congress: the Stop Online Piracy Act (SOPA) and it’s Senate version the PROTECT IP Act (PIPA). Several Internet mainstays, such as Wikipedia, Reddit andy O’Reilly blacked out their sites to protest the bill. For some information on why this legislation is so dangerous check out this excellent video by The Guardian.

The mobilization against SOPA/PIPA also included many grassroots efforts to contact Congress and demand the bill be stopped. Given the attention the bill was getting, I was curious if there was any surge in discussion of the bill by members of Congress on Twitter.

So, I created a visualization that is a cumulative timeline of tweets by members of the U.S. Congress for “SOPA” or “PIPA.” To see if there was any surge, check out the visualization for yourself.

Go see Drew’s post and draw the graph for yourself.

OK, but my question would be, who are they tweeting to? Need to distinguish targets of the tweets from those who actually read the tweets. One possible mechanism would be retweets.

That is who retweeted messages from a particular member of Congress? Would be interesting to map that to say a list of congressional contributors. Different set of identifiers for Twitter versus donation records but same subjects.

But Twitter is just surface traffic and public traffic at that. I assume after the “see my pants” episode last year that most members of Congress are a little more careful with Twitter accounts. Perhaps not.

What I would be interested in seeing is all the incoming/outgoing phone and other hidden traffic. Like Blackberries. Would not care about the content, just the points of contact. A “pen register” I think they used to call them. Not sure what you would call it for cellphone traffic.

January 11, 2012

Monthly Twitter activity for all members of the U.S. Congress

Filed under: Data Source,Government Data,Tweets — Patrick Durusau @ 8:04 pm

Monthly Twitter activity for all members of the U.S. Congress by Drew Conway.

From the post:

Many months ago I blogged about the research that John Myles White and I are conducting on using Twitter data to estimate an individual’s political ideology. As I mentioned then, we are using the Twitter activity of members of the U.S. Congress to build a training data set for our model. A large part of the effort for this project has gone into designing a system to systematically collect the Twitter data on the members of the U.S. Congress.

Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API.

Looks like an interesting data set to match up to the ages of addresses doesn’t it?

December 13, 2011

Making Sense of Microposts

Filed under: Conferences,Tweets — Patrick Durusau @ 9:49 pm

Making Sense of Microposts (#MSM2012) – Big things come in small packages

In connection with World Wide Web 2012.

Important dates:

  • Submission of Abstracts (mandatory): 03 Feb 2012
  • Paper Submission deadline: 06 Feb 2012
  • Notification of acceptance: 06 Mar 2012*
  • Camera-ready deadline: 23 Mar 2012
  • Workshop program issued: 08 Mar 2012
  • Proceedings published (CEUR): 31 Mar 2012
  • Workshop – 16 Apr 2012 (Registration open to all)

(all deadlines 23:59 Hawaii Time)

From the post:

With the appearance and expansion of Twitter, Facebook Like, Foursquare, and similar low-effort publishing services, the effort required to participate on the Web is getting lower and lower. The high-end technology user and developer and the ordinary end user of ubiquitous, personal technology, such as the smart phone, contribute diverse information to the Web as part of informal and semi-formal communication and social activity. We refer to such small user input as ‘microposts’: these range from ‘checkin’ at a location on a geo-social networking platform, through to a status update on a social networking site. Online social media platforms are now very often the portal of choice for the modern technology user accustomed to sharing public-interest information. They are, increasingly, an alternative carrier to traditional media, as seen in their role in the Arab Spring and crises such as the 2011 Japan earthquake. Online social activity has also witnessed the blurring of the lines between private lives and the semi-public online social world, opening a new window into the analysis of human behaviour, implicit knowledge, and adaptation to and adoption of technology.

The challenge of developing novel methods for processing the enormous streams of heterogeneous, disparate micropost data in intelligent ways and producing valuable outputs, that may be used on a wide variety of devices and end uses, is more important than ever before. Google+ is one of the better-known new services, whose aim is to bootstrap microposts in order to more effectively tailor search results to a user’s social graph and profile.

This workshop will examine, broadly:

  • information extraction and leveraging of semantics from microposts, with a focus on novel methods for handling the particular challenges due to enforced brevity of expression;
  • making use of the collective knowledge encoded in microposts’ semantics in innovative ways;
  • social and enterprise studies that guide the design of appealing and usable new systems based on this type of data, by leveraging Semantic Web technologies.

This workshop is unique in its interdisciplinary nature, targeting both Computer Science and the Social Sciences, to help also to break down the barriers to optimal use of Semantic Web data and technologies. The workshop will focus on both the computational means to handle microposts and the study of microposts, in order to identify the motivational aspects that drive the creation and consumption of such data.

Is tailoring of search results to “…a user’s social graph and profile” a good or bad thing? We all exist in self-imposed mono-cultures in which “other” viewpoints are allowed in carefully measured amounts. How would you gauge what we are missing?

November 25, 2011

SpiderDuck: Twitter’s Real-time URL Fetcher

Filed under: Software,Topic Map Software,Topic Map Systems,Tweets — Patrick Durusau @ 4:26 pm

SpiderDuck: Twitter’s Real-time URL Fetcher

A bit of a walk on the engineering side but in order to be relevant, topic maps do have to be written and topic map software implemented.

This a very interesting write-up of how Twitter relied mostly on open source tools to create a system that could be very relevant to topic map implementations.

For example, the fetch/no-fetch decision for URLs is based on a comparison to URLs fetched within X days. Hmmm, comparison of URLs, oh, those things that occur in subjectIdentifier and subjectLocator properties of topics. Do you smell relevance?

And there is harvesting of information from web pages, one assumes that could be done on “information items” from a topic map as well, except there it would be properties, etc. Even more relevance.

What parts of SpiderDuck do you find most relevant to a topic map implementation?

November 14, 2011

Twitter POS Tagging with LingPipe and ARK Tweet Data

Filed under: LingPipe,Linguistics,POS,Tweets — Patrick Durusau @ 7:15 pm

Twitter POS Tagging with LingPipe and ARK Tweet Data by Bob Carpenter.

From the post:

We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.

We will train and test on anything that’s easy to parse.

How’s that for a motto! 😉

Social media may be more important than I thought it was several years ago. It may just be the serialization in digital form all the banter in bars, at blocks parties and around the water cooler. If that is true, then governments would be well advised to encourage and assist with access to social media. To give them an even chance of leaving ahead of the widow maker.

Think of mining Twitter data like the NSA and phone traffic, but you aren’t doing anything illegal.

November 4, 2011

More Data: Tweets & News Articles

Filed under: Dataset,News,TREC,Tweets — Patrick Durusau @ 6:07 pm

From Max Lin’s blog, Ian Soboroff posted:

Two new collections being released from TREC today:

The first is the long-awaited Tweets2011 collection. This is 16 million tweets sampled by Twitter for use in the TREC 2011 microblog track. We distribute the tweet identifiers and a crawler, and you download the actual tweets using the crawler. http://trec.nist.gov/data/tweets/

The second is TRC2, a collection of 1.8 million news articles from Thompson Reuters used in the TREC 2010 blog track. http://trec.nist.gov/data/reuters/reuters.html

Both collections are available under extremely permissive usage agreements that limit their use to research and forbid redistribution, but otherwise are very open as data usage agreements go.

It may just be my memory but I don’t recall seeing topic map research with the older Reuters data set (the new one is too recent). Is that true?

Anyway, more large data sets for your research pleasure.

« Newer Posts

Powered by WordPress