Community detection in networks: structural clusters versus ground truth by Darko Hric, Richard K. Darst, and, Santo Fortunato.
Abstract:
Algorithms to find communities in networks rely just on structural information and search for cohesive subsets of nodes. On the other hand, most scholars implicitly or explicitly assume that structural communities represent groups of nodes with similar (non-topological) properties or functions. This hypothesis could not be verified, so far, because of the lack of network datasets with ground truth information on the classification of the nodes. We show that traditional community detection methods fail to find the ground truth clusters in many large networks. Our results show that there is a marked separation between structural and annotated clusters, in line with recent findings. That means that either our current modeling of community structure has to be substantially modified, or that annotated clusters may not be recoverable from topology alone.
Deeply interesting work if you are trying to detect “subjects” by clustering nodes in a network.
I would heed the warning that typology may not accurately represent hidden information.
Beyond this particular case, I would test any assumption that some known factor represents an unknown factor(s) for any data set. Better than the results surprise you than your client.
I first saw this in a tweet by Brian Keegan.
PS: As you already know, “ground truth” depends upon your point of view. Don’t risk your work on the basis of someone else’s “ground truth.”
[…] deeply interesting work on the formal characteristics of LOD datasets but as we learned in Community detection in networks:… a relationship between a typology (another formal characteristic) and some hidden fact(s) may or […]
Pingback by A Methodology for Empirical Analysis of LOD Datasets « Another Word For It — June 6, 2014 @ 6:52 pm