Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 31, 2012

Incremental face recognition for large-scale social network services

Filed under: Image Recognition,Social Networks — Patrick Durusau @ 4:10 pm

Incremental face recognition for large-scale social network services by Kwontaeg Choia, Kar-Ann Tohb, and Hyeran Byuna.

Abstract:

Due to the rapid growth of social network services such as Facebook and Twitter, incorporation of face recognition in these large-scale web services is attracting much attention in both academia and industry. The major problem in such applications is to deal efficiently with the growing number of samples as well as local appearance variations caused by diverse environments for the millions of users over time. In this paper, we focus on developing an incremental face recognition method for Twitter application. Particularly, a data-independent feature extraction method is proposed via binarization of a Gabor filter. Subsequently, the dimension of our Gabor representation is reduced considering various orientations at different grid positions. Finally, an incremental neural network is applied to learn the reduced Gabor features. We apply our method to a novel application which notifies new photograph uploading to related users without having their ID being identified. Our extensive experiments show that the proposed algorithm significantly outperforms several incremental face recognition methods with a dramatic reduction in computational speed. This shows the suitability of the proposed method for a large-scale web service with millions of users.

Any number of topic map uses suggest themselves for robust face recognition software.

What’s yours?

March 30, 2012

Structural Analysis of Large Networks: Observations and Applications

Filed under: Graphs,Networks,Social Networks — Patrick Durusau @ 4:34 pm

Structural Analysis of Large Networks: Observations and Applications by Mary McGlohon.

Abstract:

Network data (also referred to as relational data, social network data, real graph data) has become ubiquitous, and understanding patterns in this data has become an important research problem. We investigate how interactions in social networks are formed and how these interactions facilitate diffusion, model these behaviors, and apply these findings to real-world problems.

We examined graphs of size up to 16 million nodes, across many domains from academic citation networks, to campaign contributions and actor-movie networks. We also performed several case studies in online social networks such as blogs and message board communities.

Our major contributions are the following: (a) We discover several surprising patterns in network topology and interactions, such as Popularity Decay power law (in-links to a blog post decay with a power law with &emdash;1:5 exponent) and the oscillating size of connected components; (b) We propose generators such as the Butterfly generator that reproduce both established and new properties found in real networks; (c) several case studies, including a proposed method of detecting misstatements in accounting data, where using network effects gave a significant boost in detection accuracy.

A dissertation that establishes it isn’t the size of the network (think “web scale”) but the skill with which it is analyzed that is important.

McGlohon investigates the discovery of outliers, fraud and the like.

Worth reading and then formulating questions for your graph/graph database vendor about their support for such features.

March 4, 2012

Social networks in the database: using a graph database

Filed under: Graph Databases,Neo4j,Social Networks — Patrick Durusau @ 7:17 pm

Social networks in the database: using a graph database

The Neo4j response to Lorenzo Alberton’s post on social networks in a relational database.

From the post:

Recently Lorenzo Alberton gave a talk on Trees In The Database where he showed the most used approaches to storing trees in a relational database. Now he has moved on to an even more interesting topic with his article Graphs in the database: SQL meets social networks. Right from the beginning of his excellent article Alberton puts this technical challenge in a proper context:

Graphs are ubiquitous. Social or P2P networks, thesauri, route planning systems, recommendation systems, collaborative filtering, even the World Wide Web itself is ultimately a graph! Given their importance, it’s surely worth spending some time in studying some algorithms and models to represent and work with them effectively.

After a brief explanation of what a graph data structure is, the article goes on to show how graphs can be represented in a table-based database. The rest of the article shows in detail how an adjacency list model can be used to represent a graph in a relational database. Different examples are used to illustrate what can be done in this way.

Graph databases and Neo4j in particular offer advantages when used with graphs but the Neo4j post overlooks several points.

Unlike graph databases, SQL databases are nearly, if not always, ubiquitous. It may well be that the first “taste” of graph processing may come via a SQL database and lead users to expect more graph capabilities than a SQL solution can offer.

As Lorenzo points out in his posting, performance will vary depending upon the graph operations you need to perform. True for SQL databases and graph databases as well. Having a graph database doesn’t mean all graph algorithms run efficiently on your data set.

Finally:

A table-based system makes a good fit for static and simple data structures, ….

Isn’t going to ring true for anyone familiar with Oracle, PostgreSQL, MySQL, SQL Server, Informix, DB2 or any number of other “table-based systems.”

Graphs in the database: SQL meets social networks

Filed under: Database,Graphs,Social Networks,SQL — Patrick Durusau @ 7:17 pm

Graphs in the database: SQL meets social networks by Lorenzo Alberton.

If you are interested in graphs, SQL databases, Common Table Expressions (CTEs), together or in any combination, this is the article for you!

Lorenzo walks the reader through the basics of graphs with an emphasis on understanding how SQL techniques can be successfully used, depending upon your requirements.

From the post:

Graphs are ubiquitous. Social or P2P networks, thesauri, route planning systems, recommendation systems, collaborative filtering, even the World Wide Web itself is ultimately a graph! Given their importance, it’s surely worth spending some time in studying some algorithms and models to represent and work with them effectively. In this short article, we’re going to see how we can store a graph in a DBMS. Given how much attention my talk about storing a tree data structure in the db received, it’s probably going to be interesting to many. Unfortunately, the Tree models/techniques do not apply to generic graphs, so let’s discover how we can deal with them.

January 22, 2012

The Role of Social Networks in Information Diffusion

Filed under: Networks,Social Graphs,Social Media,Social Networks — Patrick Durusau @ 7:35 pm

The Role of Social Networks in Information Diffusion by Eytan Bakshy, Itamar Rosenn, Cameron Marlow and Lada Adamic.

Abstract:

Online social networking technologies enable individuals to simultaneously share information with any number of peers. Quantifying the causal effect of these technologies on the dissemination of information requires not only identification of who influences whom, but also of whether individuals would still propagate information in the absence of social signals about that information. We examine the role of social networks in online information diffusion with a large-scale field experiment that randomizes exposure to signals about friends’ information sharing among 253 million subjects in situ. Those who are exposed are significantly more likely to spread information, and do so sooner than those who are not exposed. We further examine the relative role of strong and weak ties in information propagation. We show that, although stronger ties are individually more influential, it is the more abundant weak ties who are responsible for the propagation of novel information. This suggests that weak ties may play a more dominant role in the dissemination of information online than currently believed.

Sample size: 253 million Facebook users.

Pay attention to the line:

We show that, although stronger ties are individually more influential, it is the more abundant weak ties who are responsible for the propagation of novel information.

If you have an “Web scale” (whatever that means) information delivery issue, you need to not only target CNN and Drudge with press releases but should consider targeting actors with abundant weak ties.

Thinking this could be important in topic map driven applications that “push” novel information into the social network of a large, distributed company. You know how few of us actually read the tiresome broadcast stuff from HR, etc., so what if the important parts were “reported” piecemeal by others?

It is great to have a large functioning topic map but it doesn’t become useful until people make the information it delivers their own and take action based upon it.

January 11, 2012

Social Networks and Archival Context Project (SNAC)

Filed under: Archives,Networks,Social Graphs,Social Networks — Patrick Durusau @ 8:03 pm

Social Networks and Archival Context Project (SNAC)

From the homepage:

The Social Networks and Archival Context Project (SNAC) will address the ongoing challenge of transforming description of and improving access to primary humanities resources through the use of advanced technologies. The project will test the feasibility of using existing archival descriptions in new ways, in order to enhance access and understanding of cultural resources in archives, libraries, and museums.

Archivists have a long history of describing the people who—acting individually, in families, or in formally organized groups—create and collect primary sources. They research and describe the people who create and are represented in the materials comprising our shared cultural legacy. However, because archivists have traditionally described records and their creators together, this information is tied to specific resources and institutions. Currently there is no system in place that aggregates and interrelates those descriptions.

Leveraging the new standard Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF), the SNAC Project will use digital technology to “unlock” descriptions of people from finding aids and link them together in exciting new ways.

On the Prototype page you will find the following description:

While many of the names found in finding aids have been carefully constructed, frequently in consultation with LCNAF, many other names present extraction and matching challenges. For example, many personal names are in direct rather than indirect (or catalog entry) order. Life dates, if present, some times appear in parentheses or brackets. Numerous names some times appear in the same <persname>, <corpname>, or <famname>. Many names are incorrectly tagged, for example, a personal name tagged as a .

We will continue to refine the extraction and matching algorithms over the course of the project, but it is anticipated that it will only be possible to address some problems through manual editing, perhaps using “professional crowd sourcing.”

While the project is still a prototype, it occurs to me that it would make a handy source of identifiers.

Try:

Or one of the many others you will find at: Find Corporate, Personal, and Family Archival Context Records.

OK, now I have a question for you: All of the foregoing also appear in Wikipedia.

For your comparison:

If you could choose only one identifier for a subject, would you choose the SNAC or the Wikipedia links?

I ask because some semantic approaches take a “one ring” approach to identification. Ignoring the existence of multiple identifiers, even URL identifiers for the same subjects.

Of course, you already know that with topic maps you can have multiple identifiers for any subject.

In CTM syntax:

bush-vannevar
href=”http://socialarchive.iath.virginia.edu/xtf/view?docId=bush-vannevar-1890-1974-cr.xml ;
href=”http://en.wikipedia.org/wiki/Vannevar_Bush ;
– “Vannevar Bush” ;
– varname: “Bush, Vannevar, 1890-1974” ;
– varname: “Bush, Vannevar, 1890-” .

Which of course means that if I want to make a statement about the webpage for Vannevar Bush at Wikipedia, I can do so without any confusion:

wikipedia-vannevar-bush
= href=”http://en.wikipedia.org/wiki/Vannevar_Bush ;
descr: “URL as subject locator.” .

Or I can comment on a page at SNAC and map additional information to it. And you will always know if I am using the URL as an identifier or to point you towards a subject.

January 4, 2012

Algorithm estimates who’s in control

Filed under: Data Analysis,Discourse,Linguistics,Social Graphs,Social Networks — Patrick Durusau @ 10:43 am

Algorithm estimates who’s in control

John Kleinberg, whose work influenced Google’s PageRank, is working on ranking something else. Kelinberg et al. developed an algorithm that ranks people, based on how they speak to each other.

This on the heels of the Big Brother’s Name is… has to have you wondering if you even want Internet access at all. 😉

Just imagine, power (who has, who doesn’t) analysis of email discussion lists, wiki edits, email archives, transcripts.

This has the potential (along with other clever analysis) to identify and populate topic maps with some very interesting subjects.

I first saw this at FlowingData

Big Brother’s Name is…

Filed under: Marketing,Networks,Social Media,Social Networks — Patrick Durusau @ 7:09 am

not the FBI, CIA, Interpol, Mossad, NSA or any other government agency.

Walmart all but claims that name at: Social Genome.

From the webpage:

In a sense, the social world — all the millions and billions of tweets, Facebook messages, blog postings, YouTube videos, and more – is a living organism itself, constantly pulsating and evolving. The Social Genome is the genome of this organism, distilling it to the most essential aspects.

At the labs, we have spent the past few years building and maintaining the Social Genome itself. We do this using public data on the Web, proprietary data, and a lot of social media. From such data we identify interesting entities and relationships, extract them, augment them with as much information as we can find, then add them to the Social Genome.

For example, when Susan Boyle was first mentioned on the Web, we quickly detected that she was becoming an interesting person in the world of social media. So we added her to the Social Genome, then monitored social media to collect more information about her. Her appearances became events, and the bigger events were added to the Social Genome as well. As another example, when a new coffee maker was mentioned on the Web, we detected and added it to the Social Genome. We strive to keep the Social Genome up to date. For example, we typically detect and add information from a tweet into the Social Genome within two seconds, from the moment the tweet arrives in our labs.

As a result of our effort, the Social Genome is a vast, constantly changing, up-to-date knowledge base, with hundreds of millions of entities and relationships. We then use the Social Genome to perform semantic analysis of social media, and to power a broad array of e-commerce applications. For example, if a user never uses the word “coffee”, but has mentioned many gourmet coffee brands (such as “Kopi Luwak”) in his tweets, we can use the Social Genome to detect the brands, and infer that he is interested in gourmet coffee. As another example, using the Social Genome, we may find that a user frequently mentions movies in her tweets. As a result, when she tweeted “I love salt!”, we can infer that she is probably talking about the movie “salt”, not the condiment (both of which appear as entities in the Social Genome).

Two seconds after you hit “send” on your tweet, it has been stripped, analyzed and added to the Social Genome at WalMart. For every tweet. Plus other data.

How should we respond to this news?

One response is to trust that WalMart and whoever it sells this data trove to, will use the information to enhance your shopping experience and achieve greater fulfilment by balancing shopping against your credit limit.

Another response is to ask for legislation to attempt regulation of a multi-national corporation that is larger than many governments.

Another response is to hold sit-ins and social consciousness raising events at WalMart locations.

My suggestion? One good turn deserves another.

WalMart is owned by someone. Walmart has a board of directors. Walmart has corporate officers. Walmart has managers, sales representatives, attorneys and advertising executives. All of who have information footprints. Perhaps not as public as ours, but they exist. Wny not gather up information on who is running Walmart? Fighting fire with fire as they say. Publish that information so that regulators, stock brokers, divorce lawyers and others can have access to it.

Let’s welcome WalMart as “Little Big Brothers.”

December 25, 2011

Arnetminer

Filed under: Networks,Social Networks — Patrick Durusau @ 6:07 pm

Arnetminer: search and mining of academic social networks

From the webpage:

Arnetminer (arnetminer.org) aims to provide comprehensive search and mining services for researcher social networks. In this system, we focus on: (1) creating a semantic-based profile for each researcher by extracting information from the distributed Web; (2) integrating academic data (e.g., the bibliographic data and the researcher profiles) from multiple sources; (3) accurately searching the heterogeneous network; (4) analyzing and discovering interesting patterns from the built researcher social network. The main search and analysis functions in arnetminer include:

  • Profile search: input a researcher name (e.g., Jie Tang), the system will return the semantic-based profile created for the researcher using information extraction techniques. In the profile page, the extracted and integrated information include: contact information, photo, citation statistics, academic achievement evaluation, (temporal) research interest, educational history, personal social graph, research funding (currently only US and CN), and publication records (including citation information, and the papers are automatically assigned to several different domains).
  • Expert finding: input a query (e.g., data mining), the system will return experts on this topic. In addition, the system will suggest the top conference and the top ranked papers on this topic. There are two ranking algorithms, VSM and ACT. The former is similar to the conventional language model and the latter is based on our Author-Conference-Topic (ACT) model. Users can also provide feedbacks to the search results.
  • Conference analysis: input a conference name (e.g., KDD), the system returns who are the most active researchers on this conference, and the top-ranked papers.
  • Course search: input a query (e.g., data mining), the system will tell you who are teaching courses relevant to the query.
  • Associate search: input two researcher names, the system returns the association path between the two researchers. The function is based on the well-known "six-degree separation" theory.
  • Sub-graph search: input a query (e.g., data mining), the system first tells you what topics are relevant to the query (e.g., five topics "Data mining", "XML Data", "Data Mining / Query Processing", "Web Data / Database design", "Web Mining" are relevant), and then display the most important sub-graph discovered on each relevant topic, augmented with a summary for the sub-graph.
  • Topic browser: based on our Author-Conference-Topic (ACT) model, we automatically discover 200 hot topics from the publications. For each topic, we automatically assign a label to represent its meanings. Furthermore, the browser presents the most active researchers, the most relevant conferences/papers, and the evolution trend of the topic is discovered.
  • Academic ranks: we define 8 measures to evaluate the researcher's achievement. The measures include "H-index", "Citation", "Uptrend, "Activity", "Longevity", "Diversity, "Sociability", "New Star". For each measure, we output a ranking list in different domains. For example, one can search who have the highest citation number in the "data mining" domain.
  • User management: one can register as a user to: (1) modify the extracted profile information; (2) provide feedback on the search results; (3) follow researchers in arnetminer; (4) create an arnetminer page (which can be used to advertise confs/workshops, or recruit students).

Arnetminer.org has been in operation on the internet for more than three years. Currently, the academic network includes more than 6,000 conferences, 3,200,000 publications, 700,000 researcher profiles. The system attracts users from more than 200 countries and receives >200,000 access logs per day. The top five countries where users come from are United States, China, Germany, India, and United Kingdom.

A rich data source and a way to explore who’s who in particular domains.

December 21, 2011

Thoughts on ICDM (the IEEE conference on Data Mining)

Filed under: Data Mining,Graphs,Social Networks — Patrick Durusau @ 7:24 pm

Thoughts on ICDM I: Negative Results (part A) by Suresh Venkatasubramanian.

From (part A):

I just got back from ICDM (the IEEE conference on Data Mining). Data mining conferences are quite different from theory conferences (and much more similar to ML or DB conferences): there are numerous satellite events (workshops, tutorials and panels in this case), many more people (551 for ICDM, and that’s on the smaller side), and a wide variety of papers that range from SODA-ish results to user studies and industrial case studies.

While your typical data mining paper is still a string of techniques cobbled together without rhyme or reason (anyone for spectral manifold-based correlation clustering with outliers using MapReduce?), there are some general themes that might be of interest to an outside viewer. What I’d like to highlight here is a trend (that I hope grows) in negative results.

It’s not particularly hard to invent a new method for doing data mining. It’s much harder to show why certain methods will fail, or why certain models don’t make sense. But in my view, the latter is exactly what the field needs in order to give it a strong inferential foundation to build on (I’ll note here that I’m talking specifically about data mining, NOT machine learning – the difference between the two is left for another post).

From (part B):

Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because

  • you can target your algorithms to graphs that look like this, hopefully making them more efficient
  • You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
  • If you’re lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.

But there aren’t even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn’t have the heavy-tailed degree distribution that’s common in social networks. It also doesn’t have a property that’s common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

Part C – forthcoming –

I am perhaps more sceptical of modeling than the author but this is a very readable and interesting set of blog posts. I will be posting Part C as soon as it appears.

Update: Thoughts on ICDM I: Negative results (part C)

From Part C:

If you come up with a better way of doing classification (for now let’s just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).

The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.

Now the more smug folks reading these are waiting with ‘ROC’ and “AUC” at the tip of their tongues, and they’d be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.

For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you’d want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).

The AUC is a popular way of comparing methods in order to balance the different error rates. It’s also attractive because it’s parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.

But is it ?

December 14, 2011

Prosper Loan Data Part II of II – Social Network Analysis: What is the Value of a Friend?

Filed under: Data Mining,Social Networks — Patrick Durusau @ 7:45 pm

Prosper Loan Data Part II of II – Social Network Analysis: What is the Value of a Friend?

From the post:

Since Prosper provides data on members and their friends who are also members, we can conduct a simple “social network” analysis. What is the value of a friend when getting approved for a loan through Prosper? I first determined how many borrowers were approved and how many borrowers were declined for a loan. Next, I determined how many approved friends each borrower had.

Moral of this story: Pick better friends. 😉

Question: Has anyone done the same sort of analysis on arrest/conviction records? Include known children in the social network as well.

What other information would you want to bind into the social network?

December 11, 2011

Klout Search Powered by ElasticSearch, Scala, Play Framework and Akka

Filed under: Social Media,Social Networks — Patrick Durusau @ 9:24 pm

Klout Search Powered by ElasticSearch, Scala, Play Framework and Akka

From the post:

At Klout, we love data and as Dave Mariani, Klout’s VP of Engineering, stated in his latest blog post, we’ve got lots of it! Klout currently uses Hadoop to crunch large volumes of data but what do we do with that data? You already know about the Klout score, but I want to talk about a new feature I’m extremely excited about — search!

Problem at Hand

I just want to start off by saying, search is hard! Yet, the requirements were pretty simple: we needed to create a robust solution that would allow us to search across all scored Klout users. Did I mention it had to be fast? Everyone likes to go fast! The problem is that 100 Million People have Klout (and that was this past September—an eternity in Social Media time) which means our search solution had to scale, scale horizontally.

Well, more of a “testimonial” as the Wizard of Oz would say but the numbers are serious enough to merit further investigation.

Although I must admit that social networking sites are spreading faster than, well, spreading faster that some social contagions.

Unless someone is joining multiple times for each one, for spamming purposes, I suspect some consolidation is in the not too distant future. What happens to all the links, etc., at the services that go away?

Just curious.

November 22, 2011

Social Network Analysis — Finding communities and influencers

Filed under: Social Networks — Patrick Durusau @ 6:58 pm

Social Network Analysis — Finding communities and influencers

Webcast: Date: Tuesday, December 6, 2011
Time: 10 PT, San Francisco

Presented by: Maksim Tsvetovat
Duration: Approximately 60 minutes.
Cost: Free

Description:

A follow-on to Analyzing Social Networks on Twitter, this webcast will concentrate on the social component of Twitter data rather then the questions of data gathering and decomposition. Using a predefined dataset, we will attempt to find communities of people on Twitter that express particular interests. We will also mine Twitter streams for cascades of information diffusion, and determine most influential individuals in these cascades. The webcast will contain an initial introduction to Social Network Analysis methods and metrics.

About Maksim Tsvetovat

Maksim Tsvetovat is an interdisciplinary scientist, a software engineer, and a jazz musician. He has received his doctorate from Carnegie Mellon University in the field of Computation, Organizations and Society, concentrating on computational modeling of evolution of social networks, diffusion of information and attitudes, and emergence of collective intelligence. Currently, he teaches social network analysis at George Mason University. He is also a co-founder of DeepMile Networks, a startup company concentrating on mapping influence in social media. Maksim also teaches executive seminars in social network analysis, including “Social Networks for Startups” and “Understanding Social Media for Decisionmakers”.

Matt O’Donnell tweeted about this webcast and Social Network Analysis for Startups. (@mdbod)

November 20, 2011

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Filed under: Graphity,Graphs,Neo4j,Networks,Social Media,Social Networks — Patrick Durusau @ 4:11 pm

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks by Rene Pickhardt.

From the post:

I already said that my first research results have been submitted to SIGMOD conference to the social networks and graph databases track. Time to sum up the results and blog about them.

I created a data model to make retrieval of social news feeds in social networks very efficient. It is able to dynamically retrieve more than 10’000 temporal ordered news feeds per second in social networks with millions of users like Facebook and Twitter by using graph data bases (like neo4j)

10,000 temporally ordered news feeds per second? I can imagine any number of use cases that fit comfortably within those performance numbers!

How about you?

Looking forward to the paper (and source code)!

October 21, 2011

ForceAtlas2

Filed under: Gephi,Graphs,Networks,Social Graphs,Social Networks — Patrick Durusau @ 7:27 pm

ForceAtlas2 (paper) +appendices by Mathieu Jacomy, Sebastien Heymann, Tommaso Venturini, and Mathieu Bastian.

Abstract:

ForceAtlas2 is a force vector algorithm proposed in the Gephi software, appreciated for its simplicity and for the readability of the networks it helps to visualize. This paper presents its distinctive features, its energy-model and the way it optimizes the “speed versus precision” approximation to allow quick convergence. We also claim that ForceAtlas2 is handy because the force vector principle is unaffected by optimizations, offering a smooth and accurate experience to users.

I knew I had to cite this paper when I read:

These earliest Gephi users were not fully satisfied with existing spatialization tools. We worked on empirical improvements and that’s how we created the first version of our own algorithm, ForceAtlas. Its particularity was a degree-dependant repulsion force that causes less visual cluttering. Since then we steadily added some features while trying to keep in touch with users’ needs. ForceAtlas2 is the result of this long process: a simple and straightforward algorithm, made to be useful for experts and profanes. (footnotes omitted, emphasis added)

Profanes. I like that! Well, rather I like the literacy that enables a writer to use that in a technical paper.

Highly recommended paper.

September 28, 2011

Thoora is Your Robot Buddy for Exploring Web Topics

Filed under: Search Engines,Searching,Social Networks — Patrick Durusau @ 7:34 pm

Thoora is Your Robot Buddy for Exploring Web Topics by Jon Mitchell. (on ReadWriteWeb)

From the post:

With a Web full of stuff, discovery is a hard problem. Search engines were the first tools on the scene, but their rankings still have a hard time identifying relevance the same way a human user would. These days, social networks are the substitute for content discovery, and even the major search engines are using your social signals to determine what’s relevant for you. But the obvious problem with social search is that if your friends haven’t discovered it yet, it’s not on your radar.

At some point, someone in the social graph has to discover something for the first time. With so much new content getting churned out all the time, a Web surfer looking for something original could use some algorithmic help. A new app called Thoora, which launched its public beta last week, uses the power of machine learning to help users uncover new content on topics that interest them.

Create topic, Thoora suggests keywords, choose some, can declare them to be equivalent, results shared with others by default.

Users who create “good” topics can develop followings.

Although topics can be shared, the article does not mention sharing keywords.

Seems like a missed opportunity to crowd-source keywords from multiple “good” authors on the same topic to improve the results. That is you supply five or six keywords for topic A and I come along and suggest some additional keywords for topic A, perhaps from a topic I already have. Would require “acceptance” by the first user but that should not be hard.

I was amused to read in the Thoora FAQ:

Finally, Google News has no social component. Thoora was created so that topics could be shared and followed, because your topics – once painted with your expert brush – are super-valuable to others and ripe for sharing.

Sharing keywords is far more powerful that sharing topics. We have all had the experience of searching for something and a companion suggests a different word and we find the object of our search. Sharing in Thoora now is like following tweets. Useful, but not all that it could be.

If you decide to use Thoora, would appreciate your views and comments.

September 2, 2011

Category-Based Routing in Social Networks:…

Filed under: Identity,Networks,Social Networks — Patrick Durusau @ 7:58 pm

Category-Based Routing in Social Networks: Membership Dimension and the Small-World Phenomenon (Short) by David Eppstein, Michael T. Goodrich, Maarten Löffler, Darren Strash, and Lowell Trott.

Abstract:

A classic experiment by Milgram shows that individuals can route messages along short paths in social networks, given only simple categorical information about recipients (such as “he is a prominent lawyer in Boston” or “she is a Freshman sociology major at Harvard”). That is, these networks have very short paths between pairs of nodes (the so-called small-world phenomenon); moreover, participants are able to route messages along these paths even though each person is only aware of a small part of the network topology. Some sociologists conjecture that participants in such scenarios use a greedy routing strategy in which they forward messages to acquaintances that have more categories in common with the recipient than they do, and similar strategies have recently been proposed for routing messages in dynamic ad-hoc networks of mobile devices. In this paper, we introduce a network property called membership dimension, which characterizes the cognitive load required to maintain relationships between participants and categories in a social network. We show that any connected network has a system of categories that will support greedy routing, but that these categories can be made to have small membership dimension if and only if the underlying network exhibits the small-world phenomenon.

So, if identity is a social construct and the result of small-world networks, then we may need a different kind of precision (from scientific measurement) to identify subjects.

Perhaps the reverse of 20-questions, how many questions do we need for a particular subject? Does anyone remember if there was a common number of questions that were sufficient for the 20-questions game?

July 23, 2011

Information Propagation in Twitter’s Network

Filed under: Networks,Similarity,Social Networks — Patrick Durusau @ 3:12 pm

Information Propagation in Twitter’s Network

From the post:

It’s well-known that Twitter’s most powerful use is as tool for real-time journalism. Trying to understand its social connections and outstanding capacity to propagate information, we have developed a mathematical model to identify the evolution of a single tweet.

The way a tweet is spread through the network is closely related with Twitter’s retweet functionality, but retweet information is fairly incomplete due to the fight for earning credit/users by means of being the original source/author. We have taken into consideration this behavior and our approach uses text similarity measures as complement of retweet information. In addition, #hashtags and urls are included in the process since they have an important role in Twitter’s information propagation.

Once we designed (and implemented) our mathematical model, we tested it with some Twitter’s topics we had tracked using a visualization tool (Life of a Tweet) . Our conclusiones after the experiments were:

  1. Twitter’s real propagation is based on information (tweets’ content) and not on Twitter’s structure (retweet).
  2. Based on we can detect Twitter’s real propagation, we can retrieve Twitter’s real networks.
  3. Text similarity scores allow us to select how fuzzy are the tweet’s connections and, in extension, the network’s connections. This means that we can set a minimun threshold to determine when two tweets contain the same concept.

Interesting. Useful for anyone who want to grab “real” connections and networks to create topics for merging further information about the same.

You may want to also look at: Meme Diffusion Through Mass Social Media which is about a $900K NSF project on tracking memes through social media.

Admittedly an important area of research but the results I would view with a great deal of caution. Here’s why:

  1. Memes travel through news outlets, print, radio, TV, websites
  2. Memes travel through social outlets, such as churches, synagogues, mosques, social clubs
  3. Memes travel through business relationships and work places
  4. Memes travel through family gatherings and relationships
  5. Memes travel over cell phone conversations as well as tweets

That some social media is easier to obtain and process than others doesn’t make it a reliable basis for decision making.

July 17, 2011

Social Media in Strategic Communication (SMISC)

Filed under: Funding,Marketing,Social Networks — Patrick Durusau @ 7:25 pm

Social Media in Strategic Communication (SMISC)

From the Synopsis:

DARPA is soliciting innovative research proposals in the area of social media in strategic communication. Proposed research should investigate innovative approaches that enable revolutionary advances in science, devices, or systems. Specifically excluded is research that primarily results in evolutionary improvements to the existing state of practice. See the full DARPA-BAA-11-64 document attached.

Important Dates
Posting Date: see announcement at www.fbo.gov
Proposal Due Date
Initial Closing: August 30, 2011, 12:00 noon (ET)
Final Closing: October 11, 2011, 12:00 noon (ET)
Industry Day: Tuesday, August 2, 2011

Contracting Office Address:
3701 North Fairfax Drive
Arlington, Virginia 22203-1714
Primary Point of Contact.:
Dr. Rand Waltzman
DARPA-BAA-11-64@darpa.mil

From the Funding Opportunity Description:

DARPA is soliciting innovative research proposals in the area of social media in strategic communication. Proposed research should investigate innovative approaches that enable revolutionary advances in science, devices, or systems. Specifically excluded is research that primarily results in evolutionary improvements to the existing state of practice. (emphasis added)

I think topic maps could be part of an approach that is revolutionary, not evolutionary.

I don’t have the infrastructure to field an application but if you do and have need for a wooly-pated consultant on such a project, give me a call.

PS: I first saw this in a tweet from Tim O’Reilly.

July 5, 2011

Meme Diffusion Through Mass Social Media

Filed under: Meme,Social Networks — Patrick Durusau @ 1:38 pm

Meme Diffusion Through Mass Social Media

Abstract:

The project is aimed at modeling the diffusion of information online and empirically discriminating among models of mechanisms driving the spread of memes. We explore why some ideas cause viral explosions while others are quickly forgotten. Our analysis goes beyond the traditional approach of applied epidemic diffusion processes and focuses on cascade size distributions and popularity time series in order to model the agents and processes driving the online diffusion of information, including: users and their topical interests, competition for user attention, and the chronological age of information. Completion of our project will result in a better understanding of information flow and could assist in elucidating the complex mechanisms that underlie a variety of human dynamics and organizations. The analysis will involve studying meme diffusion in large-scale social media by collecting and analyzing massive streams of public micro-blogging data.

The project stands to benefit both the research community and the public significantly. Our data will be made available via APIs and include information on meme propagation networks, statistical data, and relevant user and content features. The open-source platform we develop will be made publicly available and will be extensible to ever more research areas as a greater preponderance of human activities are replicated online. Additionally, we will create a web service open to the public for monitoring trends, bursts, and suspicious memes. This service could mitigate the diffusion of false and misleading ideas, detect hate speech and subversive propaganda, and assist in the preservation of open debate.

NSF grant to date of a little over $900K.

I wonder about a web service to: “… mitigate the diffusion of false and misleading ideas, detect hate speech and subversive propaganda, and assist in the preservation of open debate.”

The definitions of “false and misleading ideas,” as well as “hate speech and subversive propaganda,” vary from community to community.

July 3, 2011

Who’s Your Daddy?

Filed under: Data Source,Dataset,Marketing,Mashups,Social Graphs,Social Networks — Patrick Durusau @ 7:30 pm

Who’s Your Daddy? (Genealogy and Corruption, American Style)

NPR (National Public Radio) News broadcast the opinion this morning that Brits are marginally less corrupt than Americans. Interesting question. Was Bonnie less corrupt than Clyde? Debate at your leisure but the story did prompt me to think of an excellent resource for tracking both U.S. and British style corruption.

Probably all the talk of lineage in the news lately but why not use the genealogy records that are gathered so obsessively to track the soft corruption of influence?

Just another data set to overlay on elected, appointed, and hired positions, lobbyists, disclosure statements, contributions, known sightings, congressional legislation and administrative regulations, etc. Could lead to a “Who’s Your Daddy?” segment on NPR where employment or contracts are questioned naming names. That would be interesting.

It also seems more likely to be effective than the “disclose your corruption” sunlight approach. Corruption is never confessed, it has to be rooted out.

June 28, 2011

Explore the Marvel Universe Social Graph

Filed under: Government Data,Graphs,Social Graphs,Social Networks — Patrick Durusau @ 9:50 am

Explore the Marvel Universe Social Graph

From the post (but be sure to see the images):

From Friday evening to Sunday afternoon, Kai Chang, Tom Turner, and Jefferson Braswell were tuning their visualizations and had a lot of fun exploring Spiderman or Captain america ego network. They came with these beautiful snapshots and created a zoomable web version using the Seadragon plugin. The won the “Most aesthetically pleasing visualization” category, congratulations to Kai, Tom and Jefferson for their amazing work!

The datasets have been added to the wiki Datasets page, so you can play with it and maybe calculate some metrics like centrality on the network. The graph is pretty large, so be sure to increase you Gephi memory settings with > 2GB.

I am sure the Marvel Comic graph is a lot more amusing but I can’t help but wonder about ego networks that combined:

  • Lobbyists registered with the US government
  • Elected and appointed officials and their staffs, plus staff’s families
  • Washington social calendar reports
  • Political donations
  • The Green Book

Topic maps could play a role in layering contracts, legislation and other matters onto the various ego networks.

March 21, 2011

Providing Recommendations in Social Networks Using Python: AtePassar Study Case

Filed under: Python,Social Networks — Patrick Durusau @ 8:55 am

Providing Recommendations in Social Networks Using Python: AtePassar Study Case

From post:

Recently I’ve been working on recommendations, specially related to social networks. One of my tasks is to investigate, create and analyze a recommendation engine capable of generating suggestions of friends, study groups, videos and related content to a registered user in a social network.

The social network that I am working on is called AtePassar, a brazilian social network for people who wants apply for positions at brazilian civil (government) services. One of the great features of this social network is because people can share their interests about studies and meet people all around Brazil with same interests or someone that will apply for the same exam as him. Can you imagine the possibilities ?

Applications that assist in the authoring of topic maps (to say nothing of recommending topics from topic maps) are going to make “recommendations.”

February 28, 2011

RDBMS in the Social Networks Age

Filed under: Networks,RDBMS,Social Networks — Patrick Durusau @ 10:03 am

RDBMS in the Social Networks Age by Lorenzo Alberton.

A slide deck that made me wish I had seen the presentation!

Its treatment of graph representation in a relational system is particularly strong.

The bibliography is useful as well.

Just to tempt you into viewing the slide deck, slide 19, The Boring Stuff, is very amusing.

February 14, 2011

NetworkX Introduction: Hacking social networks using the Python programming language

Filed under: Social Networks — Patrick Durusau @ 11:34 am

NetworkX Introduction: Hacking social networks using the Python programming language

Aric Hagberg (Los Alamos National Laboratory) and Drew Conway (New York University) for the 2011 Sunbelt Conference on Social Networks.

I ran across this while look at the “Data Bootcamp” materials that Drew Conway posted.

Still reading The social life of information and remembered not all topic map folks use Perl, ;-), two good reasons to mention this resource.

Seriously, information only exists (in any meaningful sense) in social networks.

It could be that information exists independently of us, but how interesting is that?

Even the arguments about the existence of information in our absence takes place in our presence. How ironic is that?

February 13, 2011

Software for Non-Human Users?

The description of: Emerging Intelligent Data and Web Technologies (EIDWT-2011) is a call for software designed for non-human users.

The Social Life of Information by John Seely Brown and Paul Duguid, makes it clear that human users don’t want to share data because sharing data represents a loss of power/status.

A poll of the readers of CACM or Computer would report a universal experience of working in an office where information is hoarded up by individuals in order to increase their own status or power.

9/11 was preceded and followed by, to this day, by a non-sharing of intelligence data. Even national peril cannot overcome the non-sharing reflex with regard to data.

EIDWT-2011 and conferences like it, are predicated on a sharing of data known to not exist, at least among human users.

Hence, I suspect the call must be directed at software for non-human users.

Emerging Intelligent Data and Web Technologies (EIDWT-2011)

2nd International Conference on Emerging Intelligent Data and Web Technologies (EIDWT-2011)

From the announcement:

The 2-nd International Conference on Emerging Intelligent Data and Web Technologies (EIDWT-2011) is dedicated to the dissemination of original contributions that are related to the theories, practices and concepts of emerging data technologies yet most importantly of their applicability in business and academia towards a collective intelligence approach. In particular, EIDWT-2011 will discuss advances about utilizing and exploiting data generated from emerging data technologies such as Data Centers, Data Grids, Clouds, Crowds, Mashups, Social Networks and/or other Web 2.0 implementations towards a collaborative and collective intelligence approach leading to advancements of virtual organizations and their user communities. This is because, current and future Web and Web 2.0 implementations will store and continuously produce a vast amount of data, which if combined and analyzed through a collective intelligence manner will make a difference in the organizational settings and their user communities. Thus, the scope of EIDWT-2011 is to discuss methods and practices (including P2P) which bring various emerging data technologies together to capture, integrate, analyze, mine, annotate and visualize data – made available from various community users – in a meaningful and collaborative for the organization manner. Finally, EIDWT-2011 aims to provide a forum for original discussion and prompt future directions in the area.

Important Dates:

Submission Deadline: March 10, 2011
Authors Notification: May 10, 2011
Author Registration: June 10, 2011
Final Manuscript: July 1, 2011
Conference Dates: September 7 – 9, 2011

January 24, 2011

Visualizing Social Networks

Filed under: Social Networks,Visualization — Patrick Durusau @ 6:21 am

Visualizing Social Networks

A goldmine of resources on visualizing social networks!

Important for topic maps because if you think about it, all subjects exist in some social context, that is to say a social network.

Visualization can assist in exploring what parts of a social network have or have not been represented in a topic map.

This is a resource that I will be exploring over time.

January 10, 2011

NoSQL Tapes

Filed under: Cassandra,CouchDB,Graphs,MongoDB,Neo4j,Networks,NoSQL,OrientDB,Social Networks — Patrick Durusau @ 1:33 pm

NoSQL Tapes: A filmed compilation of interviews, explanations & case studies

From the email announcement by Tim Anglade:

Late last year, as the NOSQL Summer drew to a close, I got the itch to start another NOSQL community project. So, with the help of vendors Scality and InfiniteGraph, I toured around the world for 77 days to meet and record video interviews with 40+ NOSQL vendors, users and dudes-you-can-trust.

….

My original goals were to attempt to map a comprehensive view of the NOSQL world, its origins, its current trends and potential future. NOSQL knowledge seemed to me to be heavily fragmented and hard to reconcile across projects, vendors & opinions. I wanted to try to foster more sharing in our community and figure out what people thought ‘NOSQL’ meant. As it happens, I ended up learning quite a lot in the process (as I’m sure even seasoned NOSQLers on this list will too).

I’d like to take this opportunity to thank everybody who agreed to participate in this series: 10gen, Basho, Cloudant, CouchOne, FourSquare, Ben Black, RethinkDB, MarkLogic, Cloudera, SimpleGeo, LinkedIn, Membase, Ryan Rawson, Cliff Moon, Gemini Mobile, Furuhashi-san, Luca Garulli, Sergio Bossa, Mathias Meyer, Wooga, Neo4J, Acunu (and a few other special guests I’m keeping under wraps for now); I couldn’t have done it without them and learned by leaps & bounds for every hour I spent with each of them.

I’d also like to thank my two sponsors, Scality & InfiniteGraph, from the bottom of my heart. They were supportive in a way I didn’t think companies could be and let me total control of the shape & content of the project. I’d encourage you to check them out if you haven’t done so already.

As always, I’ll be glad to take any comments or suggestions you may have either by email (tim@nosqltapes.com) or on Twitter (@timanglade).

Simply awesome!

January 9, 2011

Center for Computational Analysis of Social and Organizational Systems (CASOS)

Center for Computational Analysis of Social and Organizational Systems (CASOS)

Home of both ORA and AutoMap but I thought it merited an entry of its own.

Directed by Dr. Kathleen Carley:

CASOS brings together computer science, dynamic network analysis and the empirical study of complex socio-technical systems. Computational and social network techniques are combined to develop a better understanding of the fundamental principles of organizing, coordinating, managing and destabilizing systems of intelligent adaptive agents (human and artificial) engaged in real tasks at the team, organizational or social level. Whether the research involves the development of metrics, theories, computer simulations, toolkits, or new data analysis techniques advances in computer science are combined with a deep understanding of the underlying cognitive, social, political, business and policy issues.

CASOS is a university wide center drawing on a group of world class faculty, students and research and administrative staff in multiple departments at Carnegie Mellon. CASOS fosters multi-disciplinary research in which students and faculty work with students and faculty in other universities as well as scientists and practitioners in industry and government. CASOS research leads the way in examining network dynamics and in linking social networks to other types of networks such as knowledge networks. This work has led to the development of new statistical toolkits for the collection and analysis of network data (Ora and AutoMap). Additionally, a number of validated multi-agent network models in areas as diverse as network evolution , bio-terrorism, covert networks, and organizational adaptation have been developed and used to increase our understanding of real socio-technical systems.

CASOS research spans multiple disciplines and technologies. Social networks, dynamic networks, agent based models, complex systems, link analysis, entity extraction, link extraction, anomaly detection, and machine learning are among the methodologies used by members of CASOS to tackle real world problems.

Definitely a group that bears watching by anyone interested in topic maps!

« Newer Posts

Powered by WordPress