Archive for the ‘Tweets’ Category

Automated Archival and Visual Analysis of Tweets…

Thursday, May 16th, 2013

Automated Archival and Visual Analysis of Tweets Mentioning #bog13, Bioinformatics, #rstats, and Others by Stephen Turner.

From the post:

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I’ve been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter’s built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t – the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t‘s documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here’s the code as of May 14, 2013:

Stephen promises in his post that the script updates automatically and you may find “unsavory” tweets.

I didn’t but that may be a matter of happenstance or sensitivity. ;-)

Analyzing Twitter: An End-to-End Data Pipeline Recap

Monday, May 13th, 2013

Analyzing Twitter: An End-to-End Data Pipeline Recap by Jason Barbour.

Jason reviews presentations at a recent Data Science MD meeting:

Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

(…)

Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C’s of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

Great summaries, links to additional resources and the complete slides.

Check the DC Data Community Events Calendar if you plan to visit the DC area. (I assume residents already do.)

Finding Significant Phrases in Tweets with NLTK

Sunday, May 12th, 2013

Finding Significant Phrases in Tweets with NLTK by Sujit Pal.

From the post:

Earlier this week, there was a question about finding significant phrases in text on the Natural Language Processing People (login required) group on LinkedIn. I suggested looking at this LingPipe tutorial. The idea is to find statistically significant word collocations, ie, those that occur more frequently than we can explain away as due to chance. I first became aware of this approach from the LLG Book, where two approaches are described – one based on Log-Likelihood Ratios (LLR) and one based on the Chi-Squared test of independence – the latter is used by LingPipe.

I had originally set out to actually provide an implementation for my suggestion (to answer a followup question). However, the Scipy Pydoc notes that the chi-squared test may be invalid when the number of observed or expected frequencies in each category are too small. Our algorithm compares just two observed and expected frequencies, so it probably qualifies. Hence I went with the LLR approach, even though it is slightly more involved.

The idea is to find, for each bigram pair, the likelihood that the components are dependent on each other versus the likelihood that they are not. For bigrams which have a positive LLR, we repeat the analysis by adding its neighbor word, and arrive at a list of trigrams with positive LLR, and so on, until we reach the N-gram level we think makes sense for the corpus. You can find an explanation of the math in one of my earlier posts, but you will probably find a better explanation in the LLG book.

For input data, I decided to use Twitter. I’m not that familiar with the Twitter API, but I’m taking the Introduction to Data Science course on Coursera, and the first assignment provided some code to pull data from the Twitter 1% feed, so I just reused that. I preprocess the feed so I am left with about 65k English tweets using the following code:

An interesting look “behind the glass” on n-grams.

I am using AntConc to generate n-grams for proofing standards prose.

But as a finished tool, AntConc doesn’t give you insight into the technical side of the process.

Welcome to TweetMap ALPHA

Wednesday, April 24th, 2013

Welcome to TweetMap ALPHA

From the introduction popup:

TweetMap is an instance of MapD, a massively parallel database platform being developed through a collaboration between Todd Mostak, (currently a researcher at MIT), and the Harvard Center for Geographic Analysis (CGA).

The tweet database presented here starts on 12/10/2012 and ends 12/31/2012. Currently 95 million tweets are available to be queried by time, space, and keyword. This could increase to billions and we are working on real time streaming from tweet-tweeted to tweet-on-the-map in under a second.

MapD is a general purpose SQL database that can be used to provide real-time visualization and analysis of just about any very large data set. MapD makes use of commodity Graphic Processing Units (GPUs) to parallelize hard compute jobs such as that of querying and rendering very large data sets on-the-fly.

This is a real treat!

Try something popular, like “gaga,” without the quotes.

Remember this is running against 95 million tweets.

Impressive! Yes?

Meet @InfoVis_Ebooks, …

Tuesday, April 23rd, 2013

Meet @InfoVis_Ebooks, Your Source for Random InfoVis Paper Snippets by Robert Kosara.

From the post:

InfoVis Ebooks takes a random piece of text from a random paper in its repository and tweets it. It has read all of last year’s InfoVis papers, and is now getting started with the VAST proceedings. After that, it will start reading infovis papers published in last year’s EuroVis and CHI conferences, and then work its way back to previous years.

Each tweet contains a reference to the paper the snippet is from. For InfoVis, VAST, and CHI, these are DOIs rather than links. Links get long and distracting, whereas DOIs are much easier to tune out in a tweet. If you want to see the paper, google the DOI string (keep the “doi:” part). You can also take everything but the “doi:” and append it to http://dx.doi.org/ to be redirected to the paper page. For other sources, I will probably have to use links.

As the name suggests, InfoVis Ebooks is about infovis papers. If you want to do the same for SciVis, HCI, or anything else, the code is available on github.

When I first saw this, I thought it would be a source of spam.

But it lingered on a browser tab for a day or so and when I looked back at it, I started to get interested.

Not that this would help a machine but for human readers, seeing the right snippet at the right time, could lead to a good (or bad) idea.

Can’t tell which one in advance but seems like it would be worth the risk.

Perhaps we can’t guarantee serendipity but we can create conditions where it is more likely to happen.

Yes?

PS: If you start one of these feeds, let me know so I can point to it.

A walk-through for the Twitter streaming API

Sunday, April 14th, 2013

A walk-through for the Twitter streaming API by Jason Baldridge.

From the post:

Analyzing tweets is all the rage, and if you are new to the game you want to know how to get them programmatically. There are many ways to do this, but a great start is to use the Twitter streaming API, a RESTful service that allows you to pull tweets in real time based on criteria you specify. For most people, this will mean having access to the spritzer, which provides only a very small percentage of all the tweets going through Twitter at any given moment. For access to more, you need to have a special relationship with Twitter or pay Twitter or an affiliate like Gnip.

This post provides a basic walk-through for using the Twitter streaming API. You can get all of this based on the documentation provided by Twitter, but this will be slightly easier going for those new to such services. (This post is mainly geared for the first phase of the course project for students in my Applied Natural Language Processing class this semester.)

You need to have a Twitter account to do this walk-through, so obtain one now if you don’t have one already.

Basics of obtaining tweets from the Twitter stream.

I mention it as an active data source that may find its way into your topic map.

Improving Twitter search with real-time human computation ["semantics supplied"]

Tuesday, April 9th, 2013

Improving Twitter search with real-time human computation by Edwin Chen.

From the post:

Before we delve into the details, here’s an overview of how the system works.

(1) First, we monitor for which search queries are currently popular.

Behind the scenes: we run a Storm topology that tracks statistics on search queries.

For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.

(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.

Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.

For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.

Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Let’s now explore the first two sections above in more detail.

….

The post is quite awesome and I suggest you read it in full.

This resonates with a recent comment about Lotus Agenda.

The short version is a user creates a thesaurus in Agenda that enables searches enriched by the thesaurus. The user supplied semantics to enhance the searches.

In the Twitter case, human reviewers supply semantics to enhance the searches.

In both cases, Agenda and Twitter, humans are supplying semantics to enhance the searches.

I emphasize “supplying semantics” as a contrast to mechanistic searches that rely on text.

Mechanistic searches can be quite valuable but they pale beside searches where semantics have been “supplied.”

The Twitter experience is a an important clue.

The answer to semantics for searches lies somewhere between ask an expert (you get his/her semantics) and ask ask all of us (too many answers to be useful).

More to follow.

Analyzing Twitter Data with Apache Hadoop, Part 3:…

Tuesday, March 26th, 2013

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive by Jon Natkins.

From the post:

This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.

In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.

In this article, we’ll learn more about Hive, its strengths and weaknesses, and why Hive is the right choice for analyzing tweets in this application.

I didn’t realize I had missed this part of the Hive series until I saw it mentioned in the Hue post.

Good introduction to Hive.

BTW, is Twitter data becoming the “hello world” of data mining?

How-to: Analyze Twitter Data with Hue

Tuesday, March 26th, 2013

How-to: Analyze Twitter Data with Hue by Romain Rigaux.

From the post:

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

The Hadoop ecosystem continues to improve!

Question: Is anyone keeping a current listing/map of the various components in the Hadoop ecosystem?

HOWTO use Hive to SQLize your own Tweets…

Monday, March 25th, 2013

HOWTO use Hive to SQLize your own Tweets – Part One: ETL and Schema Discovery by Russell Jurney.

HOWTO use Hive to SQLize your own Tweets – Part Two: Loading Hive, SQL Queries

Russell walks you through extracting your tweets, discovering their schema, loading them into Hive and querying the result.

I just requested my tweets on Friday so expect to see them tomorrow or Tuesday.

Will be a bit more complicated than Russell’s example because I re-post tweets about older posts on my blog.

I will have to delete those, although I may want to know when a particular tweet appeared, which means I will need to capture the date(s) when a particular tweet appeared.

BTW, if you do obtain your tweet archive, consider donating it to #Tweets4Science.

#Tweets4Science

Thursday, March 14th, 2013

#Tweets4Science

From the manifesto:

User generated content has experienced an explosive growth both in the diversity of available services and the volume of topics covered by the users. Content published in micro-blogging sites such as Twitter is a rich, heterogeneous, and, above all, huge sample of the daily musings of our fellow citizens across the world.

Once qualified as inane chatter, more and more researchers are turning to Twitter data to better understand our social behavior and, no doubt, that global chatter will provide a first person account of our times to future historians.

Thus, initiatives such as the one lead by the Library of the US Congress to collect the entire Twitter Archive are laudable. However, as of today, no researcher has been granted access to that archive, there is no estimation on when such access would be possible and, on top of that, access would only be granted on site.

As researchers we understand the legal compromises one must reach with private sector, and we understand that it is fair that Twitter and resellers offer access to Twitter data, including historical data, for a fee (a rather large one, by the way). However, without the data provided by each of Twitter users such a business would be impossible and, hence, we believe that such data belongs to the users individually and as a group.

Includes links on how to download and donate your tweets.

The researchers appeal to altruism, aggregating your tweets with others may advance human knowledge.

I have a much more pragmatic reason:

While I trust the Library of Congress, I don’t trust their pay masters.

Not to sound paranoid but the delay in anyone accessing the twitter data at the Library of Congress seems odd. The astronomy community has been providing access to much larger data sets long before the first tweet.

So why is it taking so long?

While we are waiting on multiple versions of that story, download your tweets and donate them to this project.

Programming Isn’t Math

Sunday, March 10th, 2013

Programming Isn’t Math by Oscar Boykin.

From the description:

Functional programming has a rich history of drawing from mathematical theory, yet in this highly entertaining talk from the Northeast Scala Symposium, Twitter data scientist Oscar Boykin make the case that programming is distinct from mathematics. This distinction is healthy and does not mean we can’t leverage many results and concepts from mathematics.

As examples, Oscar will discuss some recent work — algebird, bijection, scalding — and show cases where mathematical purity were both helpful and harmful to developing products at Twitter.

The phrase “…highly entertaining…” may be an understatement.

The type of presentation where you want to starting reading new material during the presentation but you are afraid of missing the next gold nugget!

Definitely one to start the week on!

ViralSearch: How Viral Content Spreads over Twitter

Wednesday, March 6th, 2013

ViralSearch: How Viral Content Spreads over Twitter by Andrew Vande Moere.

From the post:

ViralSearch [microsoft.com], developed by Jake Hofman and others of Microsoft Research, visualizes how content spreads over social media, and Twitter in particular.

ViralSearch is based on hundred thousands of stories that are spread through billions of mentions of these stories, over many generations. In particular, it reveals the typical, hidden structures behind the sharing of viral videos, photos and posts as an hierarchical generation tree or as an animated bubble graph. The interface contains an interactive timeline of events, as well as a search field to explore specific phrases, stories, or Twitter users to provide an overview of how the independent actions of many individuals make content go viral.

As this tool seems only to be available within Microsoft, you can only enjoy it by watching the documentary video below.

See also NYTLabs Cascade: How Information Propagates through Social Media for a visualization of a very similar concept.

Impressive graphics!

Question: If and when you have an insight while viewing a social networking graphic, where do you capture that insight?

That is how do you link your insight into a particular point in the graphic?

Download all your tweets [Are You An Outlier/Drone Target?]

Sunday, February 17th, 2013

Download all your tweets by Ajay Ohri.

From the post:

Now that the Government of the United States of America has the legal power to request your information without a warrant (The Chinese love this!)

Anyways- you can also download your own twitter data. Liberate your data.

Have you looked at your own data? Go there at https://twitter.com/settings/account and review the changes.

Modern governments invent evidence out of whole clothe, enough to topple other governments, so whether my communications are secure or not may be a moot point.

It may make a difference on whether your communications stand out, such that they focus on inventing evidence about you.

In that case, having all your tweets, particularly with the tweets of others, could be a useful thing.

With enough data a profile could be constructed so that your tweets come within, +- some percentage, of the normal tweets for your demographic.

I don’t ever tweet about American Idol (#idol) so I am already an outlier. ;-)

Mapping the demographics to content and hash tags, along with dates, events, etc. would make for a nice graph/topic map type application.

Perhaps a deviation warning system if your tweets started to curve away from the pack.

Hiding from data mining isn’t an option.

The question is how to hide in plain sight?

Building the Library of Twitter

Saturday, January 19th, 2013

Building the Library of Twitter by Ian Armas Foster.

From the post:

On an average day people around the globe contribute 500 million messages to Twitter. Collecting and storing every single tweet and its resulting metadata from a single day would be a daunting task in and of itself.

The Library of Congress is trying something slightly more ambitious than that: storing and indexing every tweet ever posted.

With the help of social media facilitator Gnip, the Library of Congress aims to create an archive where researchers can access any tweet recorded since Twitter’s inception in 2006.

According to this update on the progress of the seemingly herculean project, the LOC has already archived 170 billion tweets and their respective metadata. That total includes the posts from 2006-2010, which Gnip compressed and sent to the LOC over three different files of 2.3 terabytes each. When the LOC uncompressed the files, they filled 20 terabytes’ worth of server space representing 21 billion tweets and its supplementary 50 metadata fields.

It is often said that 90% of the world’s data has accrued over the last two years. That is remarkably close to the truth for Twitter, as an additional 150 billion tweets (88% of the total) poured into the LOC archive in 2011 and 2012. Further, Gnip delivers hourly updates to the tune of half a billion tweets a day. That means 42 days’ worth of 2012-2013 tweets equal the total amount from 2006-2010. In all, they are dealing with 133.2 terabytes of information.

Now there’s a big data problem for you! Not to mention a resource problem for the Library of Congress.

You might want to make a contribution to help fund their work on this project.

Obviously of incredible value for researchers at all levels, smaller sub-sets of the Twitter stream may be valuable as well.

If I were designing a Twitter based lexicon for covert communication for example, I would want to use frequent terms from particular geographic locations.

And/or create patterns of tweets from particular accounts so that they don’t stand out from others.

Not to mention trying to crunch the Twitter stream for content I know must be present.

Want some hackathon friendly altmetrics data?…

Wednesday, December 26th, 2012

Want some hackathon friendly altmetrics data? arXiv tweets dataset now up on figshare by Euan Adie.

From the post:

The dataset contains details of approximately 57k tweets linking to arXiv papers, found between 1st January and 1st October this year. You’ll need to supplement it with data from the arXiv API if you need metadata about the preprints linked to. The dataset does contain follower counts and lat/lng pairs for users where possible, which could be interesting to plot.

Euan has some suggested research directions and more details on the data set.

Something to play with during the holiday “down time.” ;-)

I first saw this in a tweet by Jason Priem.

Analyzing Big Data With Twitter

Friday, December 14th, 2012

UC Berkeley Course Lectures: Analyzing Big Data With Twitter by Marti Hearst.

Marti gives a summary of this excellent class, with links to videos, slides and high level notes for the course.

If you enjoyed these materials, make a post about them, recommend them to others or even send Marti a note of appreciation.

Prof. Marti Hearst, ude.yelekreb.loohcsinull@tsraeh

Analyzing the Twitter Conversation and Interest Graphs

Monday, November 26th, 2012

Analyzing the Twitter Conversation and Interest Graphs by Marti Hearst.

From the post:

For assignment 3, students analyzed and compared a portion of the Twitter “conversation graph” and the “interest graph”. Conversations were found by looking for Twitter “@mentions” and interest graph by looking at the friend/follow graphs for a user (finding friends of friends, taking a k-core analysis, and closing the triangles). The attached document highlights many of the students’ work.

One of the most impressive graphs was made by Achal Soni. He used Java and the Twitter4J library to obtain 3000 tweets for 4 rappers (Drake, Kendrick Lamar, J Cole, and Big Sean). He extracted @mentions from these tweets, and created a graph recording edges were between the celebrities and who they were conversing with.

A clever choice of colors makes this network representation work very well.

Spark: Making Big Data Analytics Interactive and Real-­Time (Video Lecture)

Wednesday, November 14th, 2012

Spark: Making Big Data Analytics Interactive and Real-­Time by Matei Zaharia. Post from Marti Hearst.

From the post:

Spark is the hot next thing for Hadoop / MapReduce, and yesterday Matei Zaharia, a PhD student in UC Berkeley’s AMP Lab, gave us a terrific lecture about how it works and what’s coming next. The key idea is to make analysis of big data interactive and able to respond in real time. Matei also gave a live demo.

(slides here)

Spark: Lightning-Fast Cluster Computing (website).

Another great lecture from Marti’s class on Twitter and Big Data.

Managing Conference Hashtags

Tuesday, November 13th, 2012

David Karger tweets today:

Ironically amusing that ontology researchers can’t manage to agree on a canonical tag for their conference #iswc #iswc12 #iswc2012

If that’s true for ontology researchers, what chance does the rest of the world have?

Just to help ontology researchers along a bit (in LTM syntax):

*****

/* typing topics */

[conf = "conference"]

/* scoping topics */

[SWTwiiter01 : conf = "Semantic Web, Twitter hashtag 01."]

[SWTwiiter02 : conf = "Semantic Web, Twitter hashtag 02."]

[SWTwiiter03 : conf = "Semantic Web, Twitter hashtag 03."]

[iswc2012 : conf = "ISWC 2012, The 11th International Semantic Web Conference"
("#iswc" / SWTwitter01)
("#iswc12" / SWTwitter02)
("#iswc2012" / SWTwitter03)]

*****

I added the “conf” typing topic to the scoping topics to distinguish those tags from other for:

ISWC (International Standard Musical Work Code)

Welcome to ISWC 2013! The International Symposium on Wearable Computers (ISWC)

Wikipedia – ISWC, also lists:

International Speed Windsurfing Class

But missed:

International Student Welcome Committee

There remains the task of distinguishing tags in the wild from tags for these other subjects.

Once that is done, all the tweets about the conference, under these or other tags, can be collocated for a full set of tweets about the conference.

Other subjects and relationships, such as person, date, location, topic, tags, retweets, etc., can be just as easily added.


Personally I would make the default sort order for Tweet a function of date/time, quite possibly mis-using sortname for that purpose. People are accustomed to seeing Tweets in time order and fancy collocation can wait until they select an author, subject, tag, etc.

Mapping Racist Tweets

Tuesday, November 13th, 2012

Where America’s Racist Tweets Come From by Megan Garber.

WARNING: The cited article has very racist and offensive tweets. They are reproduced to illustrate the technique, not to promote racism.

Megan reports on the work of Floating Sheep, geography academics.

Surprising thing about the geographic distribution (it’s pretty much all over the nation) is the lack of racist tweets from Montana. Where all the survivalist types have bunkered up.

Then I remembered, they don’t have Internet access in log and dirt bunkers. Probably no electricity or running water as well. Some politics are their own reward. ;-)

You may also appreciate the longer original post at Floating Sheep: Mapping Racist Tweets in Response to President Obama’s Re-election.

Illustrates mapping of tweets by geo-locations.

Mapping against other characteristics of geo-locations could be interesting as well.

I first saw this in a tweet by Ed Chi.

Twitter Flies by Hadoop on Search Quest

Friday, November 9th, 2012

Twitter Flies by Hadoop on Search Quest by Ian Armas Foster.

From the post:

People who use Twitter may not give a second thought to the search bar at the top of the page. It’s pretty basic, right? You type something into the nifty little box and, like the marvel of efficient search that it is, it offers suggestions for things the user might wish to search during the typing process.

On the surface, it operates like any decent search engine. Except, of course, this is Twitter we’re talking about. There is no basic functionality at the core here. As it turns out, a significant amount of effort went into designing the Twitter search suggestion engine and the network is still just getting started refining this engine.

A recent Twitter-published scientific paper tells the tale of Twitter’s journey through their previously existing Hadoop infrastructure to a custom combined infrastructure. This connects the HDFS to a frontend cache (to deal with queries and responses) and a backend (which houses algorithms that rank relevance).

The latency of the Hadoop solution was too high.

Makes me think about topic map authoring with a real time “merging” interface. One that displays the results of a current topic, association or occurrence that is being authored on the map.

Or at least the option to choose to see such a display with some reasonable response time.

Intro to Scalding by @posco and @argyris [video lecture]

Sunday, November 4th, 2012

Intro to Scalding by @posco and @argyris by Marti Hearst.

From the post:

On Thursday we learned about an alternative language for analyzing big data: Scalding. It’s built on Scala and is used extensively by the Twitter Revenue group. Oscar Boykin presented a lecture that he and Argyris Zymnis put together for us:

(video – see Marti’s post)

Because scalding is built on the functional programming language Scala, it has advantage oover Pig in that you can have the equivalent of user-defined functions directly in your code. See for the lecture notes more details. Be sure watch the video to get all the details especially since Oscar managed to make us all laugh throughout his lecture. Thanks guys!

Another great lecture from Marti’s class, “Analyzing Big Data with Twitter.”

When the revenue department of a business, at least a successful business, starts using a technology, it’s time to take notice.

Saving Tweets

Sunday, November 4th, 2012

No, it not another social cause to save X but rather Pierre Lindenbaum saving his own tweets in: Saving your tweets in a database using sqlite, rhino, scribe, javascript.

Requires sqlite, Apache Rhino, Scribe and Apache codec.

Mapping the saved tweets comes to mind. I am sure you can imagine other uses in a network of topic maps.

Predicting what topics will trend on Twitter [Predicting Merging?]

Friday, November 2nd, 2012

Predicting what topics will trend on Twitter

From the post:

Twitter’s home page features a regularly updated list of topics that are “trending,” meaning that tweets about them have suddenly exploded in volume. A position on the list is highly coveted as a source of free publicity, but the selection of topics is automatic, based on a proprietary algorithm that factors in both the number of tweets and recent increases in that number.

At the Interdisciplinary Workshop on Information and Decision in Social Networks at MIT in November, Associate Professor Devavrat Shah and his student, Stanislav Nikolov, will present a new algorithm that can, with 95 percent accuracy, predict which topics will trend an average of an hour and a half before Twitter’s algorithm puts them on the list — and sometimes as much as four or five hours before.

If you can’t attend the Interdisciplinary Workshop on Information and Decision in Social Networks workshop, which has an exciting final program, try Stanislav Nikolov thesis, Trend or No Trend: A Novel Nonparametric Method for Classifying Time Series.

Abstract:

In supervised classification, one attempts to learn a model of how objects map to labels by selecting the best model from some model space. The choice of model space encodes assumptions about the problem. We propose a setting for model specification and selection in supervised learning based on a latent source model. In this setting, we specify the model by a small collection of unknown latent sources and posit that there is a stochastic model relating latent sources and observations. With this setting in mind, we propose a nonparametric classification method that is entirely unaware of the structure of these latent sources. Instead, our method relies on the data as a proxy for the unknown latent sources. We perform classification by computing the conditional class probabilities for an observation based on our stochastic model. This approach has an appealing and natural interpretation — that an observation belongs to a certain class if it sufficiently resembles other examples of that class.

We extend this approach to the problem of online time series classification. In the binary case, we derive an estimator for online signal detection and an associated implementation that is simple, efficient, and scalable. We demonstrate the merit of our approach by applying it to the task of detecting trending topics on Twitter. Using a small sample of Tweets, our method can detect trends before Twitter does 79% of the time, with a mean early advantage of 1.43 hours, while maintaining a 95% true positive rate and a 4% false positive rate. In addition, our method provides the flexibility to perform well under a variety of tradeoffs between types of error and relative detection time.

This will be interesting in many classification contexts.

Particularly predicting what topics a user will say represent the same subject.

Design a Twitter Like Application with Nati Shalom

Thursday, November 1st, 2012

Design a Twitter Like Application with Nati Shalom

From the description:

Design a large scale NoSQL/DataGrid application similar to Twitter with Nati Shalom.

The use case is solved with Gigaspaces and Cassandra but other NoSQL and DataGrids solutions could be used.

Slides : xebia-video.s3-website-eu-west-1.amazonaws.com/2012-02/realtime-analytics-for-big-data-a-twitter-case-study-v2-ipad.pdf

If you enjoyed the posts I pointed to at: Building your own Facebook Realtime Analytics System, you will enjoy the video. (Same author.)

Not to mention Nati teaches patterns, the specific software being important but incidental.

The one million tweet map

Tuesday, October 30th, 2012

The one million tweet map

Displays the last one million tweets by geographic location, plus the top five (5) hashtags.

So tweets are not just 140 or less character strings, they are locations as well. Wondering how far you can take re-purposing of a tweet?

Powered by Maptimize.

I first saw this at Mashable.com.

BTW, I don’t find the Adobe Social ad (part of the video at Mashable) all that convincing.

You?

Information Diffusion on Twitter by @snikolov

Friday, October 26th, 2012

Information Diffusion on Twitter by @snikolov by Marti Hearst.

From the post:

Today Stan Nikolov, who just finished his masters at MIT in studying information diffusion networks, walked us through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure (from the popular Easley and Kleinberg Network book). Stan also gathered a huge amount of Twitter data, processed it using Pig scripts, and graphed the results using Gephi. The video lecture below shows you some great visualizations of the spreading behavior of the data!

(video omitted)

The slides in his Lecture Notes let you see the Pig scripts in more detail.

Another deeply awesome lecture from Marti’s class on Twitter and big data.

Also an example of the level of analysis that a Twitter stream will need to withstand to avoid “imperial entanglements.”

Kurt Thomas on Security at Twitter and Everywhere

Wednesday, October 24th, 2012

Kurt Thomas on Security at Twitter and Everywhere by Marti Hearst.

From the post:

Kurt Thomas is a former Twitter engineer and a current PhD student at UC Berkeley who studies how the criminal underground conspires to make money via unintended uses of computer systems.

Lecture notes.

Focus is on underground economies that depend upon theft of data or compromise of access to data.

Suspect if you started making money over a free service, that would be an “unintended use” as well.

The Data Science Community on Twitter

Wednesday, October 24th, 2012

The Data Science Community on Twitter

From the webpage:

659 Twitter accounts linked to data science, May 2012.

Linkage of Twitter accounts to display followers and following nodes.

That sounds so inadequate (and is).

You need to go see the page, play with it and then come back.

How was that? Impressive yes?

OK, how would that experience be different if you were using a topic map?

More/less information? Other display options?

It is an impressive piece of eye candy but I have a sense it could be so much more.

You?