Similarity « Another Word For It

July 25, 2011

Subject Recognition Measure?

Filed under: Similarity — Patrick Durusau @ 6:43 pm

I ran across the following passage this weekend:

Speed of Processing: Reaction Time. The speed with which subjects can judge statements about category membership is on of the most widely used measures of processing in semantic memory research within the human information-processing framework. Subjects typically are required to respond true or false to statements of the form: X item is a member of Y category, where the dependent variable of interest is reaction time. In such tasks, for natural language categories, responses of true are invariably faster for the items that have been rated more prototypical.

Principles of Categoization by Eleanor Rosch, in Cognition and Categorization, edited by Eleanor Rorsch and Barbara Lloyd, Lawrence Erlbaum Associates, Publishers, Hillsdale, New Jersey, 1978.

This could be part of a topic map authoring UI that asks users to recognize and place subjects into categories. The faster a user responds, the greater the confidence in their answer.

I borrowed the book where the essay appears to read Amos Tversky’s challenge to the geometric approach to similarity. More on that later this week.

Comments Off

July 23, 2011

Information Propagation in Twitter’s Network

Filed under: Networks,Similarity,Social Networks — Patrick Durusau @ 3:12 pm

Information Propagation in Twitter’s Network

From the post:

It’s well-known that Twitter’s most powerful use is as tool for real-time journalism. Trying to understand its social connections and outstanding capacity to propagate information, we have developed a mathematical model to identify the evolution of a single tweet.

The way a tweet is spread through the network is closely related with Twitter’s retweet functionality, but retweet information is fairly incomplete due to the fight for earning credit/users by means of being the original source/author. We have taken into consideration this behavior and our approach uses text similarity measures as complement of retweet information. In addition, #hashtags and urls are included in the process since they have an important role in Twitter’s information propagation.

Once we designed (and implemented) our mathematical model, we tested it with some Twitter’s topics we had tracked using a visualization tool (Life of a Tweet) . Our conclusiones after the experiments were:

Twitter’s real propagation is based on information (tweets’ content) and not on Twitter’s structure (retweet).

Based on we can detect Twitter’s real propagation, we can retrieve Twitter’s real networks.

Text similarity scores allow us to select how fuzzy are the tweet’s connections and, in extension, the network’s connections. This means that we can set a minimun threshold to determine when two tweets contain the same concept.

Interesting. Useful for anyone who want to grab “real” connections and networks to create topics for merging further information about the same.

You may want to also look at: Meme Diffusion Through Mass Social Media which is about a $900K NSF project on tracking memes through social media.

Admittedly an important area of research but the results I would view with a great deal of caution. Here’s why:

Memes travel through news outlets, print, radio, TV, websites
Memes travel through social outlets, such as churches, synagogues, mosques, social clubs
Memes travel through business relationships and work places
Memes travel through family gatherings and relationships
Memes travel over cell phone conversations as well as tweets

That some social media is easier to obtain and process than others doesn’t make it a reliable basis for decision making.

Comments Off

January 25, 2011

NAQ Tree in Your Forest?

Filed under: Dimension Reduction,Equivalence Class,High Dimensionality,Indexing,NAQ-tree,Similarity — Patrick Durusau @ 10:39 am

Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space Authors: Ming Zhang and Reda Alhajj Keywords: Knn search, High dimensionality, Dimensionality reduction, Indexing, Similarity search

Abstract:

Similarity search (e.g., k-nearest neighbor search) in high-dimensional metric space is the key operation in many applications, such as multimedia databases, image retrieval and object recognition, among others. The high dimensionality and the huge size of the data set require an index structure to facilitate the search. State-of-the-art index structures are built by partitioning the data set based on distances to certain reference point(s). Using the index, search is confined to a small number of partitions. However, these methods either ignore the property of the data distribution (e.g., VP-tree and its variants) or produce non-disjoint partitions (e.g., M-tree and its variants, DBM-tree); these greatly affect the search efficiency. In this paper, we study the effectiveness of a new index structure, called Nested-Approximate-eQuivalence-class tree (NAQ-tree), which overcomes the above disadvantages. NAQ-tree is constructed by recursively dividing the data set into nested approximate equivalence classes. The conducted analysis and the reported comparative test results demonstrate the effectiveness of NAQ-tree in significantly improving the search efficiency.

Although I think the following paragraph from the paper is more interesting:

Consider a set of objects O = {o1 , o2 , . . . , on } and a set of attributes A = {a1 , a2 , . . . , ad }, we first divide the objects into groups based on the first attribute a1 , i.e., objects with same value of a1 are put in the same group; each group is an equivalence class [23] with respect to a1 . In other words, all objects in a group are indistinguishable by attribute a1 . We can refine the equivalence classes further by dividing each existing equivalence class into groups based on the second attribute a2 ; all objects in a refined equivalence class are indistinguishable by attributes a1 and a2 . This process may be repeated by adding one more attribute at a time until all the attributes are considered. Finally, we get a hierarchical set of equivalence classes, i.e., a hierarchical partitioning of the objects. This is roughly the basic idea of NAQ-tree, i.e., to partition the data space in our similarity search method. In other words, given a query object o, we can gradually reduce the search space by gradually considering the most relevant attributes.

With the caveat that this technique is focused on metric spaces.

But I rather like the idea of reducing the search space by the attributes under consideration. Replace search space with similarity/sameness space and you will see what I mean. Still relevant for searching as well.

Comments Off

January 20, 2011

Record Linkage: Similarity Measures and Algorithms

Filed under: Record Linkage,Similarity — Patrick Durusau @ 11:30 am

Record Linkage: Similarity Measures and Algorithms Authors Nick Koudas, Sunita Sarawagi, Divesh Srivastava

A little dated (2006) but still a very useful review of similarity measures under the rubric of record linkage.

Comments Off

January 13, 2011

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing – Post

Filed under: Data Mining,Similarity — Patrick Durusau @ 5:42 am

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing

Bob Carpenter of Ling-Pipe Blog points out the treatment of Jaccard distance in Mining Massive Datasets by Anand Rajaraman and Jeffrey D. Ullman.

Worth a close look.

Comments (1)

January 3, 2011

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation – Post

Filed under: Duplicates,Natural Language Processing,Similarity,String Matching — Patrick Durusau @ 3:03 pm

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

Good coverage of tokenization of tweets and the use of the Jaccard Distance measure to determine similarity.

Of course, for a topic map, similarity may not lead to being discarded but trigger other operations instead.

Comments Off

December 26, 2010

XML Schema Element Similarity Measures: A Schema Matching Context

Filed under: Similarity,Subject Identity,Topic Maps — Patrick Durusau @ 4:33 pm

XML Schema Element Similarity Measures: A Schema Matching Context Authors: Alsayed Algergawy, Richi Nayak, Gunter Saake

Abstract:

In this paper, we classify, review, and experimentally compare major methods that are exploited in the definition, adoption, and utilization of element similarity measures in the context of XML schema matching. We aim at presenting a unified view which is useful when developing a new element similarity measure, when implementing an XML schema matching component, when using an XML schema matching system, and when comparing XML schema matching systems.

I commend the entire paper for your reading but would draw your attention to one of the conclusions in particular:

Using a single element similarity measure is not sufficient to assess the similarity between XML schema elements. This necessitates the need to utilize several element measures exploiting both internal element features and external element relationships.

Does it seem plausible that single subject similarity measures can work but it is better to use several subject similarity measures?

Questions:

Compare this paper to any recent (last two years) paper on database schema similarity. What issues are the same, different, similar? (sorry, could not think of another word for it) (2-3 pages, citations)
Create an annotated bibliography of ten (10) recent papers on XML or database schema similarity (excluding the papers in #1). (4-6 pages, citations)
How would you use any of the similarity measures you have read about in a topic map? Or is similarity enough? (3-5 pages, no citations)

Comments Off

December 18, 2010

Self-organization in Distributed Semantic Repositories – Presentation

Filed under: Clustering,Self-organization,Similarity — Patrick Durusau @ 6:19 am

Kia Teymourian Video, Slides from SOKS: Self-Organising Knowledge Systems, Amsterdam, 29 April 2010

Abstract:

Principles from nature-inspired selforganization can help to attack the massive scalability challenges in future internet infrastructures. We researched into ant-like mechanisms for clustering semantic information. We outline algorithms to store related information within clusters to facilitate efficient and scalable retrieval.

At the core are similarity measures that cannot consider global information such as a completely shared ontology. Mechanisms for syntax-based URI-similarity and the usage of a dynamic partial view on an ontology for path-length based similarity are described and evaluated. We give an outlook on how to consider application specific relations for clustering with a usecase in geo-information systems.

Research questions:

What about a similarity function where “sim = 1.0?”
What about ants with different similarity functions?
The similarity measure is RDF bound. What other similarity measures are in use?

Observation: The Wordnet Ontology is used for the evaluation. It occurred to me that Wordnet gets used a lot, but never reused. Or rather, the results of using Wordnet are never reused.

Isn’t it odd that we keep reasoning about sparrows being like ducks, over and over again? Seems like we should be able to take the results of others and build upon them. What prevents that from happening? Either in searching or ontology systems.

Comments Off

November 18, 2010

The Positive Matching Index: A new similarity measure with optimal characteristics

Filed under: Binary Distance,Similarity,Subject Identity — Patrick Durusau @ 7:57 am

The Positive Matching Index: A new similarity measure with optimal characteristics Authors: Daniel Andrés Dos Santosa, Reena Deutsch Keywords: Binary datam, Association coefficient, Jaccard index, Dice index, Similarity

Abstract:

Despite the many coefficients accounting for the resemblance between pairs of objects based on presence/absence data, no one measure shows optimal characteristics. In this work the Positive Matching Index (PMI) is proposed as a new measure of similarity between lists of attributes. PMI fulfills the Tulloss’ theoretical prerequisites for similarity coefficients, is easy to calculate and has an intrinsic meaning expressable into a natural language. PMI is bounded between 0 and 1 and represents the mean proportion of positive matches relative to the size of attribute lists, ranging this cardinality continuously from the smaller list to the larger one. PMI behaves correctly where alternative indices either fail, or only approximate to the desirable properties for a similarity index. Empirical examples associated to biomedical research are provided to show out performance of PMI in relation to standard indices such as Jaccard and Dice coefficients.

An index for people who don’t think a single measure for identity (URIs) is enough, say those in the natural sciences?

Comments Off

November 17, 2010

Normalized Kernels as Similarity Indices (and algorithm bias)

Filed under: Clustering,Kernel Methods,Similarity — Patrick Durusau @ 8:20 am

Normalized Kernels as Similarity Indices Authors(s): Julien Ah-Pine Keywords Kernels normalization, similarity indices, kernel PCA based clustering

Abstract:

Measuring similarity between objects is a fundamental issue for numerous applications in data-mining and machine learning domains. In this paper, we are interested in kernels. We particularly focus on kernel normalization methods that aim at designing proximity measures that better fit the definition and the intuition of a similarity index. To this end, we introduce a new family of normalization techniques which extends the cosine normalization. Our approach aims at refining the cosine measure between vectors in the feature space by considering another geometrical based score which is the mapped vectors’ norm ratio. We show that the designed normalized kernels satisfy the basic axioms of a similarity index unlike most unnormalized kernels. Furthermore, we prove that the proposed normalized kernels are also kernels. Finally, we assess these different similarity measures in the context of clustering tasks by using a kernel PCA based clustering approach. Our experiments employing several real-world datasets show the potential benefits of normalized kernels over the cosine normalization and the Gaussian RBF kernel.

Points out that some methods don’t result in an object being found to be most similar to…itself. What an odd result.

Moreover, it is possible for vectors the represent different scores to be treated as identical.

Questions:

What axioms of similarity indexes should we take notice of? (3-5 pages, citations)
What methods treat vectors with different scores as identical? (3-5 pages, citations)
Are geometric based similarity indices measuring semantic or geometric similarity? Are those the same concepts or different concepts? (10-15 pages, citations, you can make this a final paper if you like.)

Comments Off

November 15, 2010

Analysis of Amphibian Biodiversity Data

Filed under: Authoring Topic Maps,Bioinformatics,Similarity — Patrick Durusau @ 3:14 pm

Analysis of Amphibian Biodiversity Data.

Traditional citation: Hayek, L.-A. C. 1994. Analysis of amphibian biodiversity data. Pp. 207-269. In: Measuring and monitoring miological diversity. Standard methods for amphibians. W. R. Heyer et al., eds. (Smithsonian Institution, Washington, D. C.).

Important for two reasons:

it gathers together forty-six (46) similarity measures (yes, 46 of them)
illustrates that reading broadly is useful in topic maps work

Questions:

From Hayek, which measures would you want to use building your topic map? Why? (3-5 pages, no citations)
What measures developed after Hayek would you want to use? (specific to your data) (3-5 pages, citations)
Just curious, we talk about algorithms “measuring” similarity. Pick two things, books, articles, whatever that you think are “similar.” Would any of these algorithms say they were similar? (3-5 pages, no citations. Yes, it is a hard question.)

Comments (1)

Towards Index-based Similarity Search for Protein Structure Databases

Filed under: Bioinformatics,Biomedical,Indexing,Similarity — Patrick Durusau @ 5:00 am

Towards Index-based Similarity Search for Protein Structure Databases Authors: Orhan Çamoǧlu, Tamer Kahveci, Ambuj K. Singh Keywords: Protein structures, feature vectors, indexing, dataset join

Abstract:

We propose two methods for finding similarities in protein structure databases. Our techniques extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. These feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. This technique quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times while keeping the sensitivity similar.

Unless you want to do a project on bioinformatics indexing and topic maps, this paper probably isn’t of much interest.

I include it as an illustration of fashioning an domain specific index and for those who are interested, what subjects and their definitions lurk therein.

Questions (for those who want to pursue both topic maps and bioinformatics):

Isolate all the “we chose” aspects of the paper. What results would have been different with other choices? The “we obtained best results…” is unsatisfying. In what sense “best results?”
What aspects of this process would be amenable to use of a topic map?
What about the results (if anything) would have to be different to make these results meaningful in a topic map to be merged with results by other researchers?

Comments Off

October 17, 2010

Finding What You Want

Filed under: Data Mining,Music Retrieval,Semantics,Similarity — Patrick Durusau @ 5:00 am

The Known World, a column/blog by David Alan Grier, appears both online and in Computer, a publication of the IEEE Computer Society. Finding What You Want appears in the September, 2010 issue of Computer.

Grier explores how Pandora augments our abilities to explore the vastness of musical space. Musical retrieval systems for years had static categories imposed upon them and those work for some purposes. But, also impose requirements upon users for retrieval.

According to Grier, the “Great Napster Crisis of 1999-2001,” resulted in a new field of music retrieval systems because current areas did not quite fit.

I find Grier’s analysis interesting because to his suggestion that the methods by which we find information of interest can shape what we consider as fitting our search criteria.

Perhaps, just perhaps, identifying subjects isn’t quite the string matching, cut-n-dried, approach that is the common approach. Music retrieval systems may be a fruitful area to look for clues as to how to improve more tradition information systems.

Questions:

Review Music Retrieval: A Tutorial and Review. (Somewhat dated, can you suggest a replacement?)
Pick two or three techniques used for retrieval of music. How would you adapt those for texts?
How would you test your adapted techniques against a text collection?

Comments Off

October 11, 2010

Finding Itemset-Sharing Patterns in a Large Itemset-Associated Graph

Filed under: Data Mining,Graphs,Similarity,Subject Identity — Patrick Durusau @ 6:37 am

Finding Itemset-Sharing Patterns in a Large Itemset-Associated Graph Authors: Mutsumi Fukuzaki, Mio Seki, Hisashi Kashima, Jun Sese

Abstract:

Itemset mining and graph mining have attracted considerable attention in the field of data mining, since they have many important applications in various areas such as biology, marketing, and social network analysis. However, most existing studies focus only on either itemset mining or graph mining, and only a few studies have addressed a combination of both. In this paper, we introduce a new problem which we call itemset-sharing subgraph (ISS) set enumeration, where the task is to find sets of subgraphs with common itemsets in a large graph in which each vertex has an associated itemset. The problem has various interesting potential applications such as in side-effect analysis in drug discovery and the analysis of the influence of word-of-mouth communication in marketing in social networks. We propose an efficient algorithm ROBIN for finding ISS sets in such graph; this algorithm enumerates connected subgraphs having common itemsets and finds their combinations. Experiments using a synthetic network verify that our method can efficiently process networks with more than one million edges. Experiments using a real biological network show that our algorithm can find biologically interesting patterns. We also apply ROBIN to a citation network and find successful collaborative research works.

If you think of a set of properties, “itemset,” as a topic and an “itemset-sharing subgraph (ISS)” as a match/merging criteria, the relevance of this paper to topic maps becomes immediately obvious.

Useful for both discovery of topics in data sets as well as part processing a topic map.

Comments Off

October 1, 2010

Tell me more, not just “more of the same”

Filed under: Authoring Topic Maps,Entity Resolution,Information Retrieval,Search Interface,Similarity,Text Analytics — Patrick Durusau @ 6:12 am

Tell me more, not just “more of the same” Authors: Francisco Iacobelli, Larry Birnbaum, Kristian J. Hammond Keywords: dimensions of similarity, information retrieval, new information detection

Abstract:

The Web makes it possible for news readers to learn more about virtually any story that interests them. Media outlets and search engines typically augment their information with links to similar stories. It is up to the user to determine what new information is added by them, if any. In this paper we present Tell Me More, a system that performs this task automatically: given a seed news story, it mines the web for similar stories reported by different sources and selects snippets of text from those stories which offer new information beyond the seed story. New content may be classified as supplying: additional quotes, additional actors, additional figures and additional information depending on the criteria used to select it. In this paper we describe how the system identifies new and informative content with respect to a news story. We also how that providing an explicit categorization of new information is more useful than a binary classification (new/not-new). Lastly, we show encouraging results from a preliminary evaluation of the system that validates our approach and encourages further study.

If you are interested in the automatic extraction, classification and delivery of information, this article is for you.

I think there are (at least) two interesting ways for “Tell Me More” to develop:

First, persisting entity recognition with other data (such as story, author, date, etc.) in the form of associations (with appropriate roles, etc.).

Second, and perhaps more importantly, to enable users to add/correct information presented as part of a mapping of information about particular entities.

Comments Off

SocialSearchBrowser: A novel mobile search and information discovery tool

Filed under: Authoring Topic Maps,Interface Research/Design,Search Interface,Searching,Similarity — Patrick Durusau @ 5:13 am

SocialSearchBrowser: A novel mobile search and information discovery tool Authors: Karen Church, Joachim Neumann, Mauro Cherubini and Nuria Oliver Keywords: Mobile search, social search, social networks, location-based services, context, field study, user evaluation

Abstract:

The mobile Internet offers anytime, anywhere access to a wealth of information to billions of users across the globe. However, the mobile Internet represents a challenging information access platform due to the inherent limitations of mobile environments, limitations that go beyond simple screen size and network issues. Mobile users often have information needs which are impacted by contexts such as location and time. Furthermore, human beings are social creatures that often seek out new strategies for sharing knowledge and information in mobile settings. To investigate the social aspect of mobile search, we have developed SocialSearchBrowser (SSB), a novel proof-of-concept interface that incorporates social networking capabilities with key mobile contexts to improve the search and information discovery experience of mobile users. In this paper, we present the results of an exploratory field study of SSB and outline key implications for the design of next generation mobile information access services.

Interesting combination of traditional “ask a search engine” with even more traditional “ask your friend’s” results. Sample is too small to say what issues might be encountered with wider use but definitely a step in an interesting direction.

Comments Off

September 20, 2010

Similarity Indexing: Algorithms and Performance (1996)

Filed under: High Dimensionality,R-Trees,Similarity — Patrick Durusau @ 3:28 pm

Similarity Indexing: Algorithms and Performance (1996) Authors: David A. White , Ramesh Jain KeyWords Similarity Indexing, High Dimensional Feature Vectors, Approximate k-Nearest Neighbor Searching, Closest Points, Content-Based Retrieval, Image and Video Database Retrieval.

The authors of this paper coined the phrase “similarity indexing.”

A bit dated now but interesting as a background to techniques currently in use.

A topic map tracing the development of one of the “similarity” techniques would make an excellent thesis project.

Comments Off

September 18, 2010

The TV-tree — an index structure for high-dimensional data (1994)

Filed under: Feature Spaces,High Dimensionality,R-Trees,Similarity,Spatial Index — Patrick Durusau @ 8:05 am

The TV-tree — an index structure for high-dimensional data (1994) Authors: King-ip Lin , H. V. Jagadish , Christos Faloutsos Keywords:Spatial Index, Similarity Retrieval, Query by Context, R*-Tree, High-Dimensionality Feature Spaces.

Abstract:

We propose a file structure to index high-dimensionality data, typically, points in some feature space. The idea is to use only a few of the features, utilizing additional features whenever the additional discriminatory power is absolutely necessary. We present in detail the design of our tree structure and the associated algorithms that handle such `varying length’ feature vectors. Finally we report simulation results, comparing the proposed structure with the R -tree, which is one of the most successful methods for low-dimensionality spaces. The results illustrate the superiority of our method, with up to 80% savings in disk accesses.

The notion of “…utilizing additional features whenever the additional discriminatory power is absolutely necessary…” is an important one.

Compare to fixed simplistic discrimination and/or fixed complex, high-overhead, discrimination between subject representatives.

Either one represents a failure of imagination.

Comments Off

September 17, 2010

New Approach for Automated Categorizing and Finding Similarities in Online Persian News

Filed under: Clustering,Similarity — Patrick Durusau @ 4:39 am

New Approach for Automated Categorizing and Finding Similarities in Online Persian News Authors: Naser Ezzati Jivan, Mahlagha Fazeli and Khadije Sadat Yousefi Keywords: Categorization of web pages – category – automatic categorization of Persian news – feature – similarity – clustering – structure of web pages.

Abstract:

The Web is a great source of information where data are stored in different formats, e.g., web-pages, archive files and images. Algorithms and tools which automatically categorize web-pages have wide applications in real-life situations. A web-site which collects news from different sources can be an example of such situations. In this paper, an algorithm for categorizing news is proposed. The proposed approach is specialized to work with documents (news) written in the Persian language but it can be easily generalized to work with documents in other languages, too. There is no standard test-bench or measure to evaluate the performance of this kind of algorithms as the amount of similarity between two documents (news) is not well-defined. To test the performance of the proposed algorithm, we implemented a web-site which uses the proposed approach to find similar news. Some of the similar news items found by the algorithm has been reported.

Similarity: The first step towards subject identification.

Comments Off

September 14, 2010

International Journal of Approximate Reasoning – Volume 51, Issue 8, October 2010

Filed under: Data Mining,Similarity,Subject Identity — Patrick Durusau @ 3:49 am

International Journal of Approximate Reasoning – Volume 51, Issue 8, October 2010 has a couple of items of interest:

Heuristic algorithm for interpretation of multi-valued attributes in similarity-based fuzzy relational databases Author(s):Rafal A. Angryk, Jacek Czerniak Keywords: Similarity-based fuzzy relational databases; Multi-valued entries; Taxonomic symbolic attributes; Fuzzy similarity relation; Data mining.
Aggregating multiple classification results using fuzzy integration and stochastic feature selection Author(s):Nick J. Pizzi, Witold Pedrycz Keywords: Data classification; Fuzzy sets; Pattern recognition; Fuzzy integrals; Feature selection; Computational intelligence.

Comments Off

« Newer Posts

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 25, 2011

July 23, 2011

January 25, 2011

January 20, 2011

January 13, 2011

January 3, 2011

December 26, 2010

December 18, 2010

November 18, 2010

November 17, 2010

November 15, 2010

October 17, 2010

October 11, 2010

October 1, 2010

September 20, 2010

September 18, 2010

September 17, 2010

September 14, 2010