Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 9, 2013

AAAI – Weblogs and Social Media

Filed under: Artificial Intelligence,Blogs,Social Media,Tweets — Patrick Durusau @ 12:34 pm

Seventh International AAAI Conference on Weblogs and Social Media

Abstracts and papers from the Seventh International AAAI Conference on Weblogs and Social Media.

Much to consider:

Frontmatter: Six (6) entries.

Full Papers: Sixty-nine (69) entries.

Poster Papers: Eighteen (18) entries.

Demonstration Papers: Five (5) entries.

Computational Personality Recognition: Ten (10) entries.

Social Computing for Workforce 2.0: Seven (7) entries.

Social Media Visualization: Four (4) entries.

When the City Meets the Citizen: Nine (9) entries.

Be aware that the links for tutorials and workshops only give you the abstracts describing the tutorials and workshops.

There is the obligatory “blind men and the elephant” paper:

Blind Men and the Elephant: Detecting Evolving Groups in Social News

Abstract:

We propose an automated and unsupervised methodology for a novel summarization of group behavior based on content preference. We show that graph theoretical community evolution (based on similarity of user preference for content) is effective in indexing these dynamics. Combined with text analysis that targets automatically-identified representative content for each community, our method produces a novel multi-layered representation of evolving group behavior. We demonstrate this methodology in the context of political discourse on a social news site with data that spans more than four years and find coexisting political leanings over extended periods and a disruptive external event that lead to a significant reorganization of existing patterns. Finally, where there exists no ground truth, we propose a new evaluation approach by using entropy measures as evidence of coherence along the evolution path of these groups. This methodology is valuable to designers and managers of online forums in need of granular analytics of user activity, as well as to researchers in social and political sciences who wish to extend their inquiries to large-scale data available on the web.

It is a great paper but commits a common error when it notes:

Like the parable of Blind Men and the Elephant2, these techniques provide us with disjoint, specific pieces of information.

Yes, the parable is oft told to make a point about partial knowledge, but the careful observer will ask:

How are we different from the blind men trying to determine the nature of an elephant?

Aren’t we also blind men trying to determine the nature of blind men who are examining an elephant?

And so on?

Not that being blind men should keep us from having opinions, but it should may us wary of how deeply we are attached to them.

Not only are there elephants all the way down, there are blind men before, with (including ourselves) and around us.

June 29, 2013

Twitter visualizes billions of tweets…

Filed under: Tweets,Visualization — Patrick Durusau @ 2:31 pm

Twitter visualizes billions of tweets in artful, interactive 3D maps by Nathan Olivarez-Giles.

From the post:

On June 1st, Twitter created beautiful maps visualizing billions of geotagged tweets. Today, the social network is getting artsy once agsain, using the same dataset — which it calls Billion Strokes — to produce interactive elevation maps that render geotagged tweets in 3D. This time around, Twitter visualized geotagged tweets from San Francisco, New York, and Istanbul in maps that viewers can manipulate.

For each city map, Twitter gives users the option of adding eight different layers over the topography. Users can also change the size of the elevation differences mapped out, to get a better idea of where most tweets are sent from. The maps can be seen from either an overhead view, or on a horizontal plane. The resulting maps looking like harsh mountain ranges, but the peaks and valleys aren’t representative of the land — rather, a peak illustrates a high amount of tweets being sent from that location, while a trough displays an area where fewer tweets are sent. The whole thing was put together by Nicolas Belmonte, Twitter’s in-house data visualization scientist. You can check out the interactive maps on Twitter’s GitHub page.

I thought the contour view was the most interesting.

A visualization that shows tweet frequency by business address would be interesting as well.

Are more tweets sent from movie theaters or churches?

June 23, 2013

Mapping Twitter demographics

Filed under: Graphics,Tweets,Visualization — Patrick Durusau @ 2:04 pm

Mapping Twitter demographics by Nathan Yau.

languages of twitter

Nathan has uncovered an interactive map of over 3 billion tweets by MapBox, along with Gnip and Eric Fischer.

See Nathan’s post for details.

June 18, 2013

Twitter Analytics Platform Gives Data Back to Users

Filed under: Marketing,Tweets — Patrick Durusau @ 1:03 pm

Twitter Analytics Platform Gives Data Back to Users

From the post:

Previously reserved for advertising partners, Twitter Analytics now shows all users an overview of their timeline activity, reveals more detailed information about their followers and lets them download it all as a CSV.

Presented in a month-long timeline of activity, Twitter Analytics visualizes mentions, follows and something previously much harder to track: unfollows. Even this additional context makes Twitter Analytics a useful tool for any user.

The tool also lists out a complete record of your tweets with a few helpful columns added on; Favorites, Retweets and Replies. As you scroll, the timeline becomes fixed to the top of the browser and you can see the relationship between the content of a tweet and the response it got (if it happened in the last 30 days).

Embedded within the tweets column are some additional metrics detailing the number of clicks on links and some callouts highlighting extended reach of individual tweets. Unsurprisingly, this feed-oriented analytics interface reminds me of the ideas in Anil Dash’s Dashboards Should be Feeds. It certainly works well here.

(…)

If the government is going to have your data, then so should you! 😉

June 9, 2013

Making Sense of Patterns in the Twitterverse

Filed under: Pattern Matching,Pattern Recognition,Tweets — Patrick Durusau @ 4:45 pm

Making Sense of Patterns in the Twitterverse

From the post:

If you think keeping up with what’s happening via Twitter, Facebook and other social media is like drinking from a fire hose, multiply that by 7 billion — and you’ll have a sense of what Court Corley wakes up to every morning.

Corley, a data scientist at the Department of Energy’s Pacific Northwest National Laboratory, has created a powerful digital system capable of analyzing billions of tweets and other social media messages in just seconds, in an effort to discover patterns and make sense of all the information. His social media analysis tool, dubbed “SALSA” (SociAL Sensor Analytics), combined with extensive know-how — and a fair degree of chutzpah — allows someone like Corley to try to grasp it all.

“The world is equipped with human sensors — more than 7 billion and counting. It’s by far the most extensive sensor network on the planet. What can we learn by paying attention?” Corley said.

Among the payoffs Corley envisions are emergency responders who receive crucial early information about natural disasters such as tornadoes; a tool that public health advocates can use to better protect people’s health; and information about social unrest that could help nations protect their citizens. But finding those jewels amidst the effluent of digital minutia is a challenge.

“The task we all face is separating out the trivia, the useless information we all are blasted with every day, from the really good stuff that helps us live better lives. There’s a lot of noise, but there’s some very valuable information too.”

The work by Corley and colleagues Chase Dowling, Stuart Rose and Taylor McKenzie was named best paper given at the IEEE conference on Intelligence and Security Informatics in Seattle this week.

Another one of those “name” issues as the IEEE conference site reports:

Courtney Corley, Chase Dowling, Stuart Rose and Taylor McKenzie. SociAL Sensor Analytics: Measuring Phenomenology at Scale.

I am assuming from the other researchers matching that this is the “Court/Courtney” in question.

I was unable to find an online copy of the paper but suspect it will eventually appear in an IEEE archive.

From the news report, very interesting and useful work.

June 1, 2013

SGIKDD explorations December 2012

Filed under: BigData,Data Mining,Graphs,Knowledge Discovery,Tweets — Patrick Durusau @ 9:29 am

SGIKDD explorations December 2012

The hard copy of SIGKDD explorations arrived in the last week.

Comments to follow on several of the articles but if you are not a regular SIGKDD explorations reader, this issue may convince you to change.

Quick peek:

  • War stories from Twitter (Would you believe semantic issues persist in modern IT organizations?)
  • Analyzing heterogeneous networks (Heterogeneity, everybody talks about it….)
  • “Big Graph” (Will “Big Graph” replace “Big Data?”)
  • Mining large data streams (Will “Big Streams” replace “Big Graph?”)

Along with the current state of Big Data mining, its future and other goodies.

Posts will follow on some of the articles but I wanted to give you a head’s up.

The hard copy?

I read it while our chickens are in the yard.

Local ordinance prohibits unleashed chickens on the street so I have to keep them in the yard.

May 16, 2013

Automated Archival and Visual Analysis of Tweets…

Filed under: Searching,Tweets — Patrick Durusau @ 7:24 pm

Automated Archival and Visual Analysis of Tweets Mentioning #bog13, Bioinformatics, #rstats, and Others by Stephen Turner.

From the post:

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I’ve been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter’s built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t – the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t‘s documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here’s the code as of May 14, 2013:

Stephen promises in his post that the script updates automatically and you may find “unsavory” tweets.

I didn’t but that may be a matter of happenstance or sensitivity. 😉

May 13, 2013

Analyzing Twitter: An End-to-End Data Pipeline Recap

Filed under: BigData,Cloudera,Mahout,Tweets — Patrick Durusau @ 10:32 am

Analyzing Twitter: An End-to-End Data Pipeline Recap by Jason Barbour.

Jason reviews presentations at a recent Data Science MD meeting:

Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

(…)

Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C’s of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

Great summaries, links to additional resources and the complete slides.

Check the DC Data Community Events Calendar if you plan to visit the DC area. (I assume residents already do.)

May 12, 2013

Finding Significant Phrases in Tweets with NLTK

Filed under: Natural Language Processing,NLTK,Tweets — Patrick Durusau @ 3:17 pm

Finding Significant Phrases in Tweets with NLTK by Sujit Pal.

From the post:

Earlier this week, there was a question about finding significant phrases in text on the Natural Language Processing People (login required) group on LinkedIn. I suggested looking at this LingPipe tutorial. The idea is to find statistically significant word collocations, ie, those that occur more frequently than we can explain away as due to chance. I first became aware of this approach from the LLG Book, where two approaches are described – one based on Log-Likelihood Ratios (LLR) and one based on the Chi-Squared test of independence – the latter is used by LingPipe.

I had originally set out to actually provide an implementation for my suggestion (to answer a followup question). However, the Scipy Pydoc notes that the chi-squared test may be invalid when the number of observed or expected frequencies in each category are too small. Our algorithm compares just two observed and expected frequencies, so it probably qualifies. Hence I went with the LLR approach, even though it is slightly more involved.

The idea is to find, for each bigram pair, the likelihood that the components are dependent on each other versus the likelihood that they are not. For bigrams which have a positive LLR, we repeat the analysis by adding its neighbor word, and arrive at a list of trigrams with positive LLR, and so on, until we reach the N-gram level we think makes sense for the corpus. You can find an explanation of the math in one of my earlier posts, but you will probably find a better explanation in the LLG book.

For input data, I decided to use Twitter. I’m not that familiar with the Twitter API, but I’m taking the Introduction to Data Science course on Coursera, and the first assignment provided some code to pull data from the Twitter 1% feed, so I just reused that. I preprocess the feed so I am left with about 65k English tweets using the following code:

An interesting look “behind the glass” on n-grams.

I am using AntConc to generate n-grams for proofing standards prose.

But as a finished tool, AntConc doesn’t give you insight into the technical side of the process.

April 24, 2013

Welcome to TweetMap ALPHA

Filed under: GPU,Maps,SQL,Tweets — Patrick Durusau @ 3:57 pm

Welcome to TweetMap ALPHA

From the introduction popup:

TweetMap is an instance of MapD, a massively parallel database platform being developed through a collaboration between Todd Mostak, (currently a researcher at MIT), and the Harvard Center for Geographic Analysis (CGA).

The tweet database presented here starts on 12/10/2012 and ends 12/31/2012. Currently 95 million tweets are available to be queried by time, space, and keyword. This could increase to billions and we are working on real time streaming from tweet-tweeted to tweet-on-the-map in under a second.

MapD is a general purpose SQL database that can be used to provide real-time visualization and analysis of just about any very large data set. MapD makes use of commodity Graphic Processing Units (GPUs) to parallelize hard compute jobs such as that of querying and rendering very large data sets on-the-fly.

This is a real treat!

Try something popular, like “gaga,” without the quotes.

Remember this is running against 95 million tweets.

Impressive! Yes?

April 23, 2013

Meet @InfoVis_Ebooks, …

Filed under: Tweets,Visualization — Patrick Durusau @ 7:15 pm

Meet @InfoVis_Ebooks, Your Source for Random InfoVis Paper Snippets by Robert Kosara.

From the post:

InfoVis Ebooks takes a random piece of text from a random paper in its repository and tweets it. It has read all of last year’s InfoVis papers, and is now getting started with the VAST proceedings. After that, it will start reading infovis papers published in last year’s EuroVis and CHI conferences, and then work its way back to previous years.

Each tweet contains a reference to the paper the snippet is from. For InfoVis, VAST, and CHI, these are DOIs rather than links. Links get long and distracting, whereas DOIs are much easier to tune out in a tweet. If you want to see the paper, google the DOI string (keep the “doi:” part). You can also take everything but the “doi:” and append it to http://dx.doi.org/ to be redirected to the paper page. For other sources, I will probably have to use links.

As the name suggests, InfoVis Ebooks is about infovis papers. If you want to do the same for SciVis, HCI, or anything else, the code is available on github.

When I first saw this, I thought it would be a source of spam.

But it lingered on a browser tab for a day or so and when I looked back at it, I started to get interested.

Not that this would help a machine but for human readers, seeing the right snippet at the right time, could lead to a good (or bad) idea.

Can’t tell which one in advance but seems like it would be worth the risk.

Perhaps we can’t guarantee serendipity but we can create conditions where it is more likely to happen.

Yes?

PS: If you start one of these feeds, let me know so I can point to it.

April 14, 2013

A walk-through for the Twitter streaming API

Filed under: Scala,Tweets — Patrick Durusau @ 2:42 pm

A walk-through for the Twitter streaming API by Jason Baldridge.

From the post:

Analyzing tweets is all the rage, and if you are new to the game you want to know how to get them programmatically. There are many ways to do this, but a great start is to use the Twitter streaming API, a RESTful service that allows you to pull tweets in real time based on criteria you specify. For most people, this will mean having access to the spritzer, which provides only a very small percentage of all the tweets going through Twitter at any given moment. For access to more, you need to have a special relationship with Twitter or pay Twitter or an affiliate like Gnip.

This post provides a basic walk-through for using the Twitter streaming API. You can get all of this based on the documentation provided by Twitter, but this will be slightly easier going for those new to such services. (This post is mainly geared for the first phase of the course project for students in my Applied Natural Language Processing class this semester.)

You need to have a Twitter account to do this walk-through, so obtain one now if you don’t have one already.

Basics of obtaining tweets from the Twitter stream.

I mention it as an active data source that may find its way into your topic map.

April 9, 2013

Improving Twitter search with real-time human computation [“semantics supplied”]

Filed under: Human Computation,Search Engines,Searching,Semantics,Tweets — Patrick Durusau @ 1:54 pm

Improving Twitter search with real-time human computation by Edwin Chen.

From the post:

Before we delve into the details, here’s an overview of how the system works.

(1) First, we monitor for which search queries are currently popular.

Behind the scenes: we run a Storm topology that tracks statistics on search queries.

For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.

(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.

Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.

For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.

Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Let’s now explore the first two sections above in more detail.

….

The post is quite awesome and I suggest you read it in full.

This resonates with a recent comment about Lotus Agenda.

The short version is a user creates a thesaurus in Agenda that enables searches enriched by the thesaurus. The user supplied semantics to enhance the searches.

In the Twitter case, human reviewers supply semantics to enhance the searches.

In both cases, Agenda and Twitter, humans are supplying semantics to enhance the searches.

I emphasize “supplying semantics” as a contrast to mechanistic searches that rely on text.

Mechanistic searches can be quite valuable but they pale beside searches where semantics have been “supplied.”

The Twitter experience is a an important clue.

The answer to semantics for searches lies somewhere between ask an expert (you get his/her semantics) and ask ask all of us (too many answers to be useful).

More to follow.

March 26, 2013

Analyzing Twitter Data with Apache Hadoop, Part 3:…

Filed under: Hadoop,Hive,Tweets — Patrick Durusau @ 12:52 pm

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive by Jon Natkins.

From the post:

This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.

In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.

In this article, we’ll learn more about Hive, its strengths and weaknesses, and why Hive is the right choice for analyzing tweets in this application.

I didn’t realize I had missed this part of the Hive series until I saw it mentioned in the Hue post.

Good introduction to Hive.

BTW, is Twitter data becoming the “hello world” of data mining?

How-to: Analyze Twitter Data with Hue

Filed under: Hadoop,Hive,Hue,Tweets — Patrick Durusau @ 12:46 pm

How-to: Analyze Twitter Data with Hue by Romain Rigaux.

From the post:

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

The Hadoop ecosystem continues to improve!

Question: Is anyone keeping a current listing/map of the various components in the Hadoop ecosystem?

March 25, 2013

HOWTO use Hive to SQLize your own Tweets…

Filed under: Hive,SQL,Tweets — Patrick Durusau @ 2:59 am

HOWTO use Hive to SQLize your own Tweets – Part One: ETL and Schema Discovery by Russell Jurney.

HOWTO use Hive to SQLize your own Tweets – Part Two: Loading Hive, SQL Queries

Russell walks you through extracting your tweets, discovering their schema, loading them into Hive and querying the result.

I just requested my tweets on Friday so expect to see them tomorrow or Tuesday.

Will be a bit more complicated than Russell’s example because I re-post tweets about older posts on my blog.

I will have to delete those, although I may want to know when a particular tweet appeared, which means I will need to capture the date(s) when a particular tweet appeared.

BTW, if you do obtain your tweet archive, consider donating it to #Tweets4Science.

March 14, 2013

#Tweets4Science

Filed under: Data,Government,Tweets — Patrick Durusau @ 9:35 am

#Tweets4Science

From the manifesto:

User generated content has experienced an explosive growth both in the diversity of available services and the volume of topics covered by the users. Content published in micro-blogging sites such as Twitter is a rich, heterogeneous, and, above all, huge sample of the daily musings of our fellow citizens across the world.

Once qualified as inane chatter, more and more researchers are turning to Twitter data to better understand our social behavior and, no doubt, that global chatter will provide a first person account of our times to future historians.

Thus, initiatives such as the one lead by the Library of the US Congress to collect the entire Twitter Archive are laudable. However, as of today, no researcher has been granted access to that archive, there is no estimation on when such access would be possible and, on top of that, access would only be granted on site.

As researchers we understand the legal compromises one must reach with private sector, and we understand that it is fair that Twitter and resellers offer access to Twitter data, including historical data, for a fee (a rather large one, by the way). However, without the data provided by each of Twitter users such a business would be impossible and, hence, we believe that such data belongs to the users individually and as a group.

Includes links on how to download and donate your tweets.

The researchers appeal to altruism, aggregating your tweets with others may advance human knowledge.

I have a much more pragmatic reason:

While I trust the Library of Congress, I don’t trust their pay masters.

Not to sound paranoid but the delay in anyone accessing the twitter data at the Library of Congress seems odd. The astronomy community has been providing access to much larger data sets long before the first tweet.

So why is it taking so long?

While we are waiting on multiple versions of that story, download your tweets and donate them to this project.

March 10, 2013

Programming Isn’t Math

Filed under: Algebird,Scala,Scalding,Tweets — Patrick Durusau @ 8:41 pm

Programming Isn’t Math by Oscar Boykin.

From the description:

Functional programming has a rich history of drawing from mathematical theory, yet in this highly entertaining talk from the Northeast Scala Symposium, Twitter data scientist Oscar Boykin make the case that programming is distinct from mathematics. This distinction is healthy and does not mean we can’t leverage many results and concepts from mathematics.

As examples, Oscar will discuss some recent work — algebird, bijection, scalding — and show cases where mathematical purity were both helpful and harmful to developing products at Twitter.

The phrase “…highly entertaining…” may be an understatement.

The type of presentation where you want to starting reading new material during the presentation but you are afraid of missing the next gold nugget!

Definitely one to start the week on!

March 6, 2013

ViralSearch: How Viral Content Spreads over Twitter

Filed under: Graphics,Social Media,Tweets,Visualization — Patrick Durusau @ 11:20 am

ViralSearch: How Viral Content Spreads over Twitter by Andrew Vande Moere.

From the post:

ViralSearch [microsoft.com], developed by Jake Hofman and others of Microsoft Research, visualizes how content spreads over social media, and Twitter in particular.

ViralSearch is based on hundred thousands of stories that are spread through billions of mentions of these stories, over many generations. In particular, it reveals the typical, hidden structures behind the sharing of viral videos, photos and posts as an hierarchical generation tree or as an animated bubble graph. The interface contains an interactive timeline of events, as well as a search field to explore specific phrases, stories, or Twitter users to provide an overview of how the independent actions of many individuals make content go viral.

As this tool seems only to be available within Microsoft, you can only enjoy it by watching the documentary video below.

See also NYTLabs Cascade: How Information Propagates through Social Media for a visualization of a very similar concept.

Impressive graphics!

Question: If and when you have an insight while viewing a social networking graphic, where do you capture that insight?

That is how do you link your insight into a particular point in the graphic?

February 17, 2013

Download all your tweets [Are You An Outlier/Drone Target?]

Filed under: Data,Tweets — Patrick Durusau @ 8:17 pm

Download all your tweets by Ajay Ohri.

From the post:

Now that the Government of the United States of America has the legal power to request your information without a warrant (The Chinese love this!)

Anyways- you can also download your own twitter data. Liberate your data.

Have you looked at your own data? Go there at https://twitter.com/settings/account and review the changes.

Modern governments invent evidence out of whole clothe, enough to topple other governments, so whether my communications are secure or not may be a moot point.

It may make a difference on whether your communications stand out, such that they focus on inventing evidence about you.

In that case, having all your tweets, particularly with the tweets of others, could be a useful thing.

With enough data a profile could be constructed so that your tweets come within, +- some percentage, of the normal tweets for your demographic.

I don’t ever tweet about American Idol (#idol) so I am already an outlier. 😉

Mapping the demographics to content and hash tags, along with dates, events, etc. would make for a nice graph/topic map type application.

Perhaps a deviation warning system if your tweets started to curve away from the pack.

Hiding from data mining isn’t an option.

The question is how to hide in plain sight?

January 19, 2013

Building the Library of Twitter

Filed under: Intelligence,Security,Tweets — Patrick Durusau @ 7:08 pm

Building the Library of Twitter by Ian Armas Foster.

From the post:

On an average day people around the globe contribute 500 million messages to Twitter. Collecting and storing every single tweet and its resulting metadata from a single day would be a daunting task in and of itself.

The Library of Congress is trying something slightly more ambitious than that: storing and indexing every tweet ever posted.

With the help of social media facilitator Gnip, the Library of Congress aims to create an archive where researchers can access any tweet recorded since Twitter’s inception in 2006.

According to this update on the progress of the seemingly herculean project, the LOC has already archived 170 billion tweets and their respective metadata. That total includes the posts from 2006-2010, which Gnip compressed and sent to the LOC over three different files of 2.3 terabytes each. When the LOC uncompressed the files, they filled 20 terabytes’ worth of server space representing 21 billion tweets and its supplementary 50 metadata fields.

It is often said that 90% of the world’s data has accrued over the last two years. That is remarkably close to the truth for Twitter, as an additional 150 billion tweets (88% of the total) poured into the LOC archive in 2011 and 2012. Further, Gnip delivers hourly updates to the tune of half a billion tweets a day. That means 42 days’ worth of 2012-2013 tweets equal the total amount from 2006-2010. In all, they are dealing with 133.2 terabytes of information.

Now there’s a big data problem for you! Not to mention a resource problem for the Library of Congress.

You might want to make a contribution to help fund their work on this project.

Obviously of incredible value for researchers at all levels, smaller sub-sets of the Twitter stream may be valuable as well.

If I were designing a Twitter based lexicon for covert communication for example, I would want to use frequent terms from particular geographic locations.

And/or create patterns of tweets from particular accounts so that they don’t stand out from others.

Not to mention trying to crunch the Twitter stream for content I know must be present.

December 26, 2012

Want some hackathon friendly altmetrics data?…

Filed under: Citation Analysis,Graphs,Networks,Tweets — Patrick Durusau @ 7:30 pm

Want some hackathon friendly altmetrics data? arXiv tweets dataset now up on figshare by Euan Adie.

From the post:

The dataset contains details of approximately 57k tweets linking to arXiv papers, found between 1st January and 1st October this year. You’ll need to supplement it with data from the arXiv API if you need metadata about the preprints linked to. The dataset does contain follower counts and lat/lng pairs for users where possible, which could be interesting to plot.

Euan has some suggested research directions and more details on the data set.

Something to play with during the holiday “down time.” 😉

I first saw this in a tweet by Jason Priem.

December 14, 2012

Analyzing Big Data With Twitter

Filed under: BigData,CS Lectures,Tweets — Patrick Durusau @ 10:42 am

UC Berkeley Course Lectures: Analyzing Big Data With Twitter by Marti Hearst.

Marti gives a summary of this excellent class, with links to videos, slides and high level notes for the course.

If you enjoyed these materials, make a post about them, recommend them to others or even send Marti a note of appreciation.

Prof. Marti Hearst, ude.yelekreb.loohcsinull@tsraeh

November 26, 2012

Analyzing the Twitter Conversation and Interest Graphs

Filed under: BigData,Graphs,Tweets,Visualization — Patrick Durusau @ 5:42 am

Analyzing the Twitter Conversation and Interest Graphs by Marti Hearst.

From the post:

For assignment 3, students analyzed and compared a portion of the Twitter “conversation graph” and the “interest graph”. Conversations were found by looking for Twitter “@mentions” and interest graph by looking at the friend/follow graphs for a user (finding friends of friends, taking a k-core analysis, and closing the triangles). The attached document highlights many of the students’ work.

One of the most impressive graphs was made by Achal Soni. He used Java and the Twitter4J library to obtain 3000 tweets for 4 rappers (Drake, Kendrick Lamar, J Cole, and Big Sean). He extracted @mentions from these tweets, and created a graph recording edges were between the celebrities and who they were conversing with.

A clever choice of colors makes this network representation work very well.

November 14, 2012

Spark: Making Big Data Analytics Interactive and Real-­Time (Video Lecture)

Filed under: CS Lectures,Spark,Tweets — Patrick Durusau @ 5:41 am

Spark: Making Big Data Analytics Interactive and Real-­Time by Matei Zaharia. Post from Marti Hearst.

From the post:

Spark is the hot next thing for Hadoop / MapReduce, and yesterday Matei Zaharia, a PhD student in UC Berkeley’s AMP Lab, gave us a terrific lecture about how it works and what’s coming next. The key idea is to make analysis of big data interactive and able to respond in real time. Matei also gave a live demo.

(slides here)

Spark: Lightning-Fast Cluster Computing (website).

Another great lecture from Marti’s class on Twitter and Big Data.

November 13, 2012

Managing Conference Hashtags

Filed under: Conferences,Semantic Web,Tagging,Tweets — Patrick Durusau @ 5:13 pm

David Karger tweets today:

Ironically amusing that ontology researchers can’t manage to agree on a canonical tag for their conference #iswc #iswc12 #iswc2012

If that’s true for ontology researchers, what chance does the rest of the world have?

Just to help ontology researchers along a bit (in LTM syntax):

*****

/* typing topics */

[conf = "conference"]

/* scoping topics */

[SWTwiiter01 : conf = "Semantic Web, Twitter hashtag 01."]

[SWTwiiter02 : conf = "Semantic Web, Twitter hashtag 02."]

[SWTwiiter03 : conf = "Semantic Web, Twitter hashtag 03."]

[iswc2012 : conf = "ISWC 2012, The 11th International Semantic Web Conference"
("#iswc" / SWTwitter01)
("#iswc12" / SWTwitter02)
("#iswc2012" / SWTwitter03)]

*****

I added the “conf” typing topic to the scoping topics to distinguish those tags from other for:

ISWC (International Standard Musical Work Code)

Welcome to ISWC 2013! The International Symposium on Wearable Computers (ISWC)

Wikipedia – ISWC, also lists:

International Speed Windsurfing Class

But missed:

International Student Welcome Committee

There remains the task of distinguishing tags in the wild from tags for these other subjects.

Once that is done, all the tweets about the conference, under these or other tags, can be collocated for a full set of tweets about the conference.

Other subjects and relationships, such as person, date, location, topic, tags, retweets, etc., can be just as easily added.


Personally I would make the default sort order for Tweet a function of date/time, quite possibly mis-using sortname for that purpose. People are accustomed to seeing Tweets in time order and fancy collocation can wait until they select an author, subject, tag, etc.

Mapping Racist Tweets

Filed under: Mapping,Maps,Tweets — Patrick Durusau @ 2:54 pm

Where America’s Racist Tweets Come From by Megan Garber.

WARNING: The cited article has very racist and offensive tweets. They are reproduced to illustrate the technique, not to promote racism.

Megan reports on the work of Floating Sheep, geography academics.

Surprising thing about the geographic distribution (it’s pretty much all over the nation) is the lack of racist tweets from Montana. Where all the survivalist types have bunkered up.

Then I remembered, they don’t have Internet access in log and dirt bunkers. Probably no electricity or running water as well. Some politics are their own reward. 😉

You may also appreciate the longer original post at Floating Sheep: Mapping Racist Tweets in Response to President Obama’s Re-election.

Illustrates mapping of tweets by geo-locations.

Mapping against other characteristics of geo-locations could be interesting as well.

I first saw this in a tweet by Ed Chi.

November 9, 2012

Twitter Flies by Hadoop on Search Quest

Filed under: Hadoop,HDFS,Tweets — Patrick Durusau @ 4:38 pm

Twitter Flies by Hadoop on Search Quest by Ian Armas Foster.

From the post:

People who use Twitter may not give a second thought to the search bar at the top of the page. It’s pretty basic, right? You type something into the nifty little box and, like the marvel of efficient search that it is, it offers suggestions for things the user might wish to search during the typing process.

On the surface, it operates like any decent search engine. Except, of course, this is Twitter we’re talking about. There is no basic functionality at the core here. As it turns out, a significant amount of effort went into designing the Twitter search suggestion engine and the network is still just getting started refining this engine.

A recent Twitter-published scientific paper tells the tale of Twitter’s journey through their previously existing Hadoop infrastructure to a custom combined infrastructure. This connects the HDFS to a frontend cache (to deal with queries and responses) and a backend (which houses algorithms that rank relevance).

The latency of the Hadoop solution was too high.

Makes me think about topic map authoring with a real time “merging” interface. One that displays the results of a current topic, association or occurrence that is being authored on the map.

Or at least the option to choose to see such a display with some reasonable response time.

November 4, 2012

Intro to Scalding by @posco and @argyris [video lecture]

Filed under: BigData,Scala,Scalding,Tweets — Patrick Durusau @ 3:40 pm

Intro to Scalding by @posco and @argyris by Marti Hearst.

From the post:

On Thursday we learned about an alternative language for analyzing big data: Scalding. It’s built on Scala and is used extensively by the Twitter Revenue group. Oscar Boykin presented a lecture that he and Argyris Zymnis put together for us:

(video – see Marti’s post)

Because scalding is built on the functional programming language Scala, it has advantage oover Pig in that you can have the equivalent of user-defined functions directly in your code. See for the lecture notes more details. Be sure watch the video to get all the details especially since Oscar managed to make us all laugh throughout his lecture. Thanks guys!

Another great lecture from Marti’s class, “Analyzing Big Data with Twitter.”

When the revenue department of a business, at least a successful business, starts using a technology, it’s time to take notice.

Saving Tweets

Filed under: Javascript,SQLite,Tweets — Patrick Durusau @ 3:24 pm

No, it not another social cause to save X but rather Pierre Lindenbaum saving his own tweets in: Saving your tweets in a database using sqlite, rhino, scribe, javascript.

Requires sqlite, Apache Rhino, Scribe and Apache codec.

Mapping the saved tweets comes to mind. I am sure you can imagine other uses in a network of topic maps.

« Newer PostsOlder Posts »

Powered by WordPress