Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 17, 2010

IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

The IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

addresses the derivation and matching of the semantics of computational content to that of naturally expressed user intentions in order to retrieve, manage, manipulate or even create content, where “content” may be anything including video, audio, text, software, hardware, network, process, etc.

Being organized by Phillip C-Y Sheu (UC Irvine), psheu@uci.edu, Phone: +1 949 824 2660. Volunteers are needed for both organizational and technical committees.

This is a good way to meet people, make a positive contribution and, have a lot of fun.

The Neighborhood Auditing Tool for the UMLS and its Source Terminologies

Filed under: Authoring Topic Maps,Interface Research/Design,Mapping,Topic Maps,Usability — Patrick Durusau @ 5:19 am

The next NCBO Webinar will be presented by Dr. James Geller from the New Jersey Institute of Technology on “The Neighborhood Auditing Tool for the UMLS and its Source Terminologies” at 10:00am PDT, Wednesday, October 20.

ABSTRACT:

The UMLS’s integration of more than 100 source vocabularies makes it susceptible to errors. Furthermore, its size and complexity can make it very difficult to locate such errors. A software tool, called the Neighborhood Auditing Tool (NAT), that facilitates UMLS auditing is presented. The NAT supports “neighborhood-based” auditing, where, at any given time, an auditor concentrates on a single focus concept and one of a variety of neighborhoods of its closely related concepts. The NAT can be seen as a special browser for the complex structure of the UMLS’s hierarchies. Typical diagrammatic displays of concept networks have a number of shortcomings, so the NAT utilizes a hybrid diagram/text interface that features stylized neighborhood views which retain some of the best features of both the diagrammatic layouts and text windows while avoiding the shortcomings. The NAT allows an auditor to display knowledge from both the Metathesaurus (concept) level and the Semantic Network (semantic type) level. Various additional features of the NAT that support the auditing process are described. The usefulness of the NAT is demonstrated through a group of case studies. Its impact is tested with a study involving a select group of auditors.


WEBEX DETAILS:
Topic: NCBO Webinar Series
Date: Wednesday, October 20, 2010
Time: 10:00 am, Pacific Daylight Time (San Francisco, GMT-07:00)
Meeting Number: 929 613 752
Meeting Password: ncbomeeting

****

Deeply edited version from NCBO Webinar – James Geller, October 20 at 10:00am PT, which has numerous other details.

If you translate “integration” as “merging,” the immediate relevance to topic maps and exploration of data sets becomes immediately obvious.

Finding What You Want

Filed under: Data Mining,Music Retrieval,Semantics,Similarity — Patrick Durusau @ 5:00 am

The Known World, a column/blog by David Alan Grier, appears both online and in Computer, a publication of the IEEE Computer Society. Finding What You Want appears in the September, 2010 issue of Computer.

Grier explores how Pandora augments our abilities to explore the vastness of musical space. Musical retrieval systems for years had static categories imposed upon them and those work for some purposes. But, also impose requirements upon users for retrieval.

According to Grier, the “Great Napster Crisis of 1999-2001,” resulted in a new field of music retrieval systems because current areas did not quite fit.

I find Grier’s analysis interesting because to his suggestion that the methods by which we find information of interest can shape what we consider as fitting our search criteria.

Perhaps, just perhaps, identifying subjects isn’t quite the string matching, cut-n-dried, approach that is the common approach. Music retrieval systems may be a fruitful area to look for clues as to how to improve more tradition information systems.

Questions:

  1. Review Music Retrieval: A Tutorial and Review. (Somewhat dated, can you suggest a replacement?)
  2. Pick two or three techniques used for retrieval of music. How would you adapt those for texts?
  3. How would you test your adapted techniques against a text collection?

SGDB – Simple Graph Database Optimized for Activation Spreading Computation

Filed under: Graphs,Software,Topic Map Software — Patrick Durusau @ 3:38 am

SGDB – Simple Graph Database Optimized for Activation Spreading Computation Authors: Marek Ciglan and Kjetil Nørvåg Keywords: spreading activation, graph, link graph, graph database, persistent media

Abstract:

In this paper, we present SGDB, a graph database with a storage model optimized for computation of Spreading Activation (SA) queries. The primary goal of the system is to minimize the execution time of spreading activation algorithm over large graph structures stored on a persistent media; without pre-loading the whole graph into the memory. We propose a storage model aiming to minimize number of accesses to the storage media during execution of SA and we propose a graph query type for the activation spreading operation. Finally, we present the implementation and its performance characteristics in scope of our pilot application that uses the activation spreading over the Wikipedia link graph.

Useful if your topic map won’t fit into memory or you want to use spreading activation with your topic map. Not to mention that the SGDB has some interesting performance characteristics versus a general graph database. Or so the article says, I haven’t verified the claims, you need to make your own judgment.

Software: SGDB

October 16, 2010

Incidence of Merging?

Filed under: Merging,Topic Map Software,Topic Maps — Patrick Durusau @ 10:20 am

Is there an average incidence of merging?

I know the rhetoric well enough, discover new relationships, subjects, cross domain or even semantic universe boundaries, etc., but ok, how often?

Take for example the Opera and CIA World Fact Book topic maps. When they merge, how many topics actually merge?

One expects only the geographic locations, which is useful but what percentage of the overall topics does that represent? In either map?

Questions:

  1. Is incidence of merging a useful measurement? Yes/No, Why?
  2. Is there something beyond incidence of merging that you would measure for merged topic maps?
  3. How would you evaluate the benefits of merging two (or more) topic maps?
  4. How would you plan for merging in a topic map design?

(Either of the last two questions can be expanded into design projects.)

Topic Maps Elevator Speech

Filed under: Marketing,Topic Maps — Patrick Durusau @ 10:10 am

Topic Maps Elevator Speech.

Steve Newcomb’s winning Topic Maps Elevator Speech at TMRA 2010. Along with the certificate to “prove” he had the winning entry at a timed 40 seconds.

Yeah, I know. 40 seconds. I have heard Steve ask for permission to speak and take more than 40 seconds. 😉

Seriously, this is a good elevator speech and we need more of them.

Proceedings of the Very Large Database Endowment Inc.

Filed under: Data Mining,Searching,SQL — Patrick Durusau @ 7:11 am

Proceedings of the Very Large Database Endowment Inc.

A resource made available by the Very Large Database Endowment Inc. Who also publish the The VLDB Journal.

With titles like: Scalable multi-query optimization for exploratory queries over federated scientific databases (http://www.vldb.org/pvldb/1/1453864.pdf if you are interested), the interest factor for topic mappers is obvious.

Questions:

  1. What library journals do you scan every week/month? What subject areas
  2. What CS journals do you scan every week/month? What subject areas?
  3. Pick two different subject areas to follow for the next two months.
  4. What reading strategies did you use for the additional materials?
  5. What did you see/learn that you would have otherwise missed?

PS: Turnabout is fair play. The class can decide on two subjects areas with up to 5 journals (total) that I should be following.

October 15, 2010

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs. Authors: B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju, and Christos Faloutsos Keywords: EigenSpokes – Communities – Graphs

Abstract:

We report a surprising, persistent pattern in large sparse social graphs, which we term EigenSpokes. We focus on large Mobile Call graphs, spanning about 186K nodes and millions of calls, and find that the singular vectors of these graphs exhibit a striking EigenSpokes pattern wherein, when plotted against each other, they have clear, separate lines that often neatly align along specific axes (hence the term “spokes”). Furthermore, analysis of several other real-world datasets e.g., Patent Citations, Internet, etc. reveals similar phenomena indicating this to be a more fundamental attribute of large sparse graphs that is related to their community structure.

This is the first contribution of this paper. Additional ones include (a) study of the conditions that lead to such EigenSpokes, and (b) a fast algorithm for spotting and extracting tightly-knit communities, called SpokEn, that exploits our findings about the EigenSpokes pattern.

The notion of “chipping” off communities for further study from a large graph is quite intriguing.

In part because those communities (need I say subjects?) are found as the result of a process of exploration rather than declaration.

To be sure, those subjects can be “declared” in a topic map but the finding, identifying, deciding on subject identity properties for subjects is a lot more fun.

T-Rex Information Extraction

Filed under: Document Classification,Entity Extraction,EU,Relation Extraction — Patrick Durusau @ 6:16 am

T-Rex (Trainable Relation Extraction).

Tools for document classification, entity and relation (read association) extraction.

Topic maps of any size are going to be constructed from mining of “data” and in a lot of cases that will mean “documents” (to the extent that is a meaningful distinction).

Interesting toolkit for that purpose but apparently not being maintained. Parked at Sourceforge after having been funded by the EU.

Does anyone have a status update on this project?

Using Tag Clouds to Promote Community Awareness in Research Environments

Filed under: Interface Research/Design,Tagging,Uncategorized,Visualization — Patrick Durusau @ 6:08 am

Using Tag Clouds to Promote Community Awareness in Research Environments Authors: Alexandre Spindler, Stefania Leone, Matthias Geel, Moira C. Norrie Keywords: Tag Clouds – Ambient Information – Community Awareness

Abstract:

Tag clouds have become a popular visualisation scheme for presenting an overview of the content of document collections. We describe how we have adapted tag clouds to provide visual summaries of researchers’ activities and use these to promote awareness within a research group. Each user is associated with a tag cloud that is generated automatically based on the documents that they read and write and is integrated into an ambient information system that we have implemented.

One of the selling points of topic maps has been the serendipitous discovery of new information. Discovery is predicated on awareness and this is an interesting approach to that problem.

Questions:

  1. To what extent does awareness of tagging by colleagues influence future tagging?
  2. How would you design a project to measure the influence of tagging?
  3. Would the influence of tagging change your design of an information interface? Why/Why not? If so, how?

October 14, 2010

linloglayout

Filed under: Clustering,Graphs,Subject Identity — Patrick Durusau @ 10:45 am

linloglayout

Overview:

LinLogLayout is a simple program for computing graph layouts (positions of graph nodes in two- or three-dimensional space) and graph clusterings. It reads a graph from a file, computes a layout and a clustering, writes the layout and the clustering to a file, and displays them in a dialog. LinLogLayout can be used to identify groups of densely connected nodes in graphs, like communities of friends or collaborators in social networks, related documents in hyperlink structures (e.g. web graphs), cohesive subsystems in software systems, etc. With a change of a parameter in the main method, it can also compute classical “nice” (i.e. readable) force-directed layouts.

Finding “densely connected nodes” is one step towards finding subjects.

Subject finding tool kits will include a variety of such techniques.

Using text animated transitions to support navigation in document histories

Filed under: Authoring Topic Maps,Interface Research/Design,Software,Trails,Visualization — Patrick Durusau @ 10:34 am

Using text animated transitions to support navigation in document histories Authors: Fanny Chevalier, Pierre Dragicevic, Anastasia Bezerianos, Jean-Daniel Fekete Keywords: animated transitions, revision control, text editing

Abstract:

This article examines the benefits of using text animated transitions for navigating in the revision history of textual documents. We propose an animation technique for smoothly transitioning between different text revisions, then present the Diffamation system. Diffamation supports rapid exploration of revision histories by combining text animated transitions with simple navigation and visualization tools. We finally describe a user study showing that smooth text animation allows users to track changes in the evolution of textual documents more effectively than flipping pages.

Project website: http://www.aviz.fr/diffamation/

The video of tracking changes to a document has to be seen to be appreciated.

Research question as to how to visualize changes/revisions to a topic map. This is one starting place.

FrameWire: a tool for automatically extracting interaction logic from paper prototyping tests

Filed under: Interface Research/Design — Patrick Durusau @ 10:14 am

FrameWire: a tool for automatically extracting interaction logic from paper prototyping tests Authors: Yang Li, Xiang Cao, Katherine Everitt, Morgan Dixon, James A. Landay Keywords: paper prototyping, programming by demonstration

Abstract:

Paper prototyping offers unique affordances for interface design. However, due to its spontaneous nature and the limitations of paper, it is difficult to distill and communicate a paper prototype design and its user test findings to a wide audience. To address these issues, we created FrameWire, a computer vision-based system that automatically extracts interaction flows from the video recording of paper prototype user tests. Based on the extracted logic, FrameWire offers two distinct benefits for designers: a structural view of the video recording that allows a designer or a stakeholder to easily distill and understand the design concept and user interaction behaviors, and automatic generation of interactive HTML-based prototypes that can be easily tested with a larger group of users as well as “walked through” by other stakeholders. The extraction is achieved by automatically aggregating video frame sequences into an interaction flow graph based on frame similarities and a designer-guided clustering process. The results of evaluating FrameWire with realistic paper prototyping tests show that our extraction approach is feasible and FrameWire is a promising tool for enhancing existing prototyping practice.

A clever approach that enhances a standard tool for interface design.

Obviously useful for the design of topic map interfaces.

Curious though, has anyone used paper prototyping to design topics or the merging of topics?

October 13, 2010

Semantic Drift: What Are Linked Data/RDF and TMDM Topic Maps Missing?

Filed under: Linked Data,RDF,Subject Identifiers,Subject Identity,Topic Maps — Patrick Durusau @ 9:38 am

One RDF approach to semantic drift is to situate a vocabulary among other terms.

TMDM topic maps enable a user to gather up information that they considered as identifying the subject in question.

Additional information helps to identify a particular subject. (RDF/TMDM approaches)

Isn’t that the opposite of semantic drift?

What’s happening in both cases?

The RDF approach is guessing that it has the sense of the word as used by the author (if the right word at all).

Kelb reports approximately 48% precision.

So in 1 out of 2 emergency room situations we get the right term? (Not to knock Kelb’s work. It is an important approach that needs further development.)

Topic maps are guessing as well.

We don’t know what information in a subject identifier identifies a subject. Some of it? All of it? Under what circumstances?

Question: What information identifies a subject, at least to its author?

Answer: Ask the Author.

Asking authors what information identifies their subject(s) seems like an overlooked approach.

Domain specific vocabularies with additional information about subjects that indicates the information that identifies a subject versus merely supplemental information would be a good start.

That avoids inline syntax difficulties and enables authors to easily and quickly associate subject identification information with their documents.

Both RDF and TMDM Topic Maps could use the same vocabularies to improve their handling of associated document content.

Graphael

Filed under: Graphs,Visualization — Patrick Durusau @ 5:21 am

Graphael

Overview:

The graphael system implements several classic force-directed layout methods, as well as several novel layout methods for non-Euclidean geometries, including hyperbolic and spherical. The system can handle large graphs, using multi-scale variations of the force-directed methods. Finally, the system can layout and visualize graphs that evolve though time, using static views, animation, and morphing.

The latest postings for this project were dated 2005.

Would be interesting to compare the capabilities here with those of later graph visualization projects.

Perhaps even a topic map of graph visualization projects to study the duplication of layout methods over time?

Exploiting knowledge-in-the-head and knowledge-in-the-social-web: effects of domain expertise on exploratory search in individual and social search environments

Exploiting knowledge-in-the-head and knowledge-in-the-social-web: effects of domain expertise on exploratory search in individual and social search environments Authors: Ruogu Kang, Wai-Tat Fu, Thomas George Kannampallil Keywords: domain expertise, exploratory search, search behavior

Abstract:

Our study compared how experts and novices performed exploratory search using a traditional search engine and a social tagging system. As expected, results showed that social tagging systems could facilitate exploratory search for both experts and novices. We, however, also found that experts were better at interpreting the social tags and generating search keywords, which made them better at finding information in both interfaces. Specifically, experts found more general information than novices by better interpretation of social tags in the tagging system; and experts also found more domain-specific information by generating more of their own keywords. We found a dynamic interaction between knowledge-in-the-head and knowledge-in-the-social-web that although information seekers are more and more reliant on information from the social Web, domain expertise is still important in guiding them to find and evaluate the information. Implications on the design of social search systems that facilitate exploratory search are also discussed.

Every librarian should have the first page of this article posted to their office door, every library school on the local bulletin board.

Think about it. Expert searchers (read librarians) find better information than novices and can serve as guides to better information.

More research is needed on how to bridge that gap in search interfaces.

In libraries I think it is now called a “reference interview.”

(Please email, tweet, etc. this post to your librarian friends.)

Effects of popularity and quality on the usage of query suggestions during information search

Filed under: Information Retrieval,Search Interface,Searching — Patrick Durusau @ 4:44 am

Effects of popularity and quality on the usage of query suggestions during information search Authors: Diane Kelly, Amber Cushing, Maureen Dostert, Xi Niu, Karl Gyllstrom Keywords: query popularity, query quality, query recommendation, query suggestion, search behavior, social search, usage

Abstract:

Many search systems provide users with recommended queries during online information seeking. Although usage statistics are often used to recommend queries, this information is usually not displayed to the user. In this study, we investigate how the presentation of this information impacts use of query suggestions. Twenty-three subjects used an experimental search system to find documents about four topics. Eight query suggestions were provided for each topic: four were high quality queries and four were low quality queries. Fake usage information indicating how many other people used the queries was also provided. For half the queries this information was high and for the other half this information was low. Results showed that subjects could distinguish between high and low quality queries and were not influenced by the usage information. Qualitative data revealed that subjects felt favorable about the suggestions, but the usage information was less important for the search task used in this study.

Another small sample study but raises questions that successful interfaces will consider.

Where successful means used effectively and seen by users as effective. The latter being the most important measure.

Reactive information foraging for evolving goals

Filed under: Interface Research/Design,Navigation,Search Interface,Searching — Patrick Durusau @ 4:28 am

Reactive information foraging for evolving goals Authors: Joseph Lawrance, Margaret Burnett, Rachel Bellamy, Christopher Bogart, Calvin Swart Keywords: field study, information foraging theory, programming

Abstract:

Information foraging models have predicted the navigation paths of people browsing the web and (more recently) of programmers while debugging, but these models do not explicitly model users’ goals evolving over time. We present a new information foraging model called PFIS2 that does model information seeking with potentially evolving goals. We then evaluated variants of this model in a field study that analyzed programmers’ daily navigations over a seven-month period. Our results were that PFIS2 predicted users’ navigation remarkably well, even though the goals of navigation, and even the information landscape itself, were changing markedly during the pursuit of information.

In case you are wondering, “PFIS2 (Programmer Flow by Information Scent 2).”

A study of user information seeking behavior over seven (7) months following two (2) professional programmers.

Provocative work but it would be more convincing if the study sample were larger.

October 12, 2010

Semantic Drift: A Topic Map Answer (sort-of)

Filed under: Subject Identifiers,Subject Identity,TMDM,Topic Maps,XTM — Patrick Durusau @ 6:37 am

Topic maps took a different approach to the problem of identifying subjects (than RDF) and so looks at semantic drift differently.

In the original 13250, subject descriptor was defined as:

3.19 subject descriptor – Information which is intended to provide a positive, unambiguous indication of the identity of a subject, and which is the referent of an identity attribute of a topic link.

When 13250 was reformulated to focus on the XTM syntax and the legend known as the Topic Maps Data Model (TMDM), the subject descriptor of old became subject identifiers. (Clause 7, TMDM)

A subject identifier has information that identifies a subject.

The author of a topic uses information that identifies a subject to create a subject identifier. (Which is represented in a topic map by an IRI.)

Anyone can look at the subject identifier to see if they are talking about the same subject.

They are responsible for catching semantic drift if it occurs.

But, there is something missing from RDF and topic maps.

Something that would help with semantic drift, although they would use it differently.

Care to take a guess?

Tamara Munzner – Graphics

Filed under: Graphs,Researchers,Visualization — Patrick Durusau @ 6:24 am

Tamara Munzer is a professor at University of British Columbia and one of the leading researchers on visualization of data.

I ran across her site looking for information on 3D visualization of graphs.

Check out her publications or software pages for a preview of items you will see here sooner or later.

A Framework for SQL-Based Mining of Large Graphs on Relational Databases

Filed under: Graphs,SQL — Patrick Durusau @ 6:23 am

A Framework for SQL-Based Mining of Large Graphs on Relational Databases Authors: Sriganesh Srihari, Shruti Chandrashekar, Srinivasan Parthasarathy Keywords: Graph mining, SQL-based approach, Relational databases

Abstract:

We design and develop an SQL-based approach for querying and mining large graphs within a relational database management system (RDBMS). We propose a simple lightweight framework to integrate graph applications with the RDBMS through a tightly-coupled network layer, thereby leveraging efficient features of modern databases. Comparisons with straight-up main memory implementations of two kernels – breadth-first search and quasi clique detection – reveal that SQL implementations offer an attractive option in terms of productivity and performance.

Something for those with SQL backends for topic maps.

Implemented using PL/SQL so it isn’t clear how much work it would take to implement this framework on MySQL or Postgres.

If your topic map won’t fit into memory, might be worth a look.

Fast Discovery of Reliable k-terminal Subgraphs

Filed under: Bioinformatics,Graphs,Subgraphs — Patrick Durusau @ 6:20 am

Fast Discovery of Reliable k-terminal Subgraphs Authors: Melissa Kasari, Hannu Toivonen, Petteri Hintsanen

Abstract:

We present a novel and efficient algorithm for solving the most reliable subgraph problem with multiple query nodes on undirected random graphs. Reliable subgraphs are useful for summarizing connectivity between given query nodes. Formally, we are given a graph G?=?(V, E), a set of query (or terminal) nodes Q???V, and a positive integer B. The objective is to find a subgraph H???G containing Q, such that H has at most B edges, and the probability that H is connected is maximized. Previous algorithms for the problem are either computationally demanding, or restricted to only two query nodes. Our algorithm extends a previous algorithm to handle k query nodes, where 2???k???|V|. We demonstrate experimentally the usefulness of reliable k-terminal subgraphs, and the accuracy, efficiency and scalability of the proposed algorithm on real graphs derived from public biological databases.

Uses biological data from the Biomine project, to demonstrate a solution to the reliable subgraph problem. Nodes are judged on the basis of their network “reliability,” that is connectedness to the query nodes.

Wonder what it would like to have a quality of the nodes in addition to “connectedness” as the basis for the reliable subgraph calculation?

Also available at CiteSeer, Fast Discovery of Reliable k-terminal Subgraphs

October 11, 2010

Semantic Drift: An RDF Answer (sort-of)

Filed under: RDF,Semantic Web,Subject Identity — Patrick Durusau @ 7:27 am

As promised last week, there are RDF researchers working on issues related to semantic drift.

An interesting approach can be found in: Entity Reference Resolution via Spreading Activation on RDF-Graphs Authors(s): Joachim Kleb, Andreas Abecker

Abstract:

The use of natural language identifiers as reference for ontology elements—in addition to the URIs required by the Semantic Web standards—is of utmost importance because of their predominance in the human everyday life, i.e.speech or print media. Depending on the context, different names can be chosen for one and the same element, and the same element can be referenced by different names. Here homonymy and synonymy are the main cause of ambiguity in perceiving which concrete unique ontology element ought to be referenced by a specific natural language identifier describing an entity. We propose a novel method to resolve entity references under the aspect of ambiguity which explores only formal background knowledge represented in RDF graph structures. The key idea of our domain independent approach is to build an entity network with the most likely referenced ontology elements by constructing steiner graphs based on spreading activation. In addition to exploiting complex graph structures, we devise a new ranking technique that characterises the likelihood of entities in this network, i.e. interpretation contexts. Experiments in a highly polysemic domain show the ability of the algorithm to retrieve the correct ontology elements in almost all cases.

It is the situating of a concept in a context (not assignment of a URI) that enables the correct result in a polysemic domain.

This doesn’t directly model semantic drift but does represent anchoring a term in a particular context.

The questions that divides semantic technologies are:

  • Who throws the anchor?
  • Who governs the anchors?
  • Can there be more than one anchor?
  • What about “my” anchor?
  • …and others

More on those anon.

Finding Itemset-Sharing Patterns in a Large Itemset-Associated Graph

Filed under: Data Mining,Graphs,Similarity,Subject Identity — Patrick Durusau @ 6:37 am

Finding Itemset-Sharing Patterns in a Large Itemset-Associated Graph Authors: Mutsumi Fukuzaki, Mio Seki, Hisashi Kashima, Jun Sese

Abstract:

Itemset mining and graph mining have attracted considerable attention in the field of data mining, since they have many important applications in various areas such as biology, marketing, and social network analysis. However, most existing studies focus only on either itemset mining or graph mining, and only a few studies have addressed a combination of both. In this paper, we introduce a new problem which we call itemset-sharing subgraph (ISS) set enumeration, where the task is to find sets of subgraphs with common itemsets in a large graph in which each vertex has an associated itemset. The problem has various interesting potential applications such as in side-effect analysis in drug discovery and the analysis of the influence of word-of-mouth communication in marketing in social networks. We propose an efficient algorithm ROBIN for finding ISS sets in such graph; this algorithm enumerates connected subgraphs having common itemsets and finds their combinations. Experiments using a synthetic network verify that our method can efficiently process networks with more than one million edges. Experiments using a real biological network show that our algorithm can find biologically interesting patterns. We also apply ROBIN to a citation network and find successful collaborative research works.

If you think of a set of properties, “itemset,” as a topic and an “itemset-sharing subgraph (ISS)” as a match/merging criteria, the relevance of this paper to topic maps becomes immediately obvious.

Useful for both discovery of topics in data sets as well as part processing a topic map.

Analyzing the Role of Dimension Arrangement for Data Visualization in Radviz

Filed under: High Dimensionality,Information Retrieval,Visualization — Patrick Durusau @ 6:25 am

Analyzing the Role of Dimension Arrangement for Data Visualization in Radviz Authors: Luigi Caro, Vanessa Frias-Martinez, Enrique Frias-Martinez Keywords: Radial Coordinate Visulalization, Radviz, Dimension arrangement

Abstract:

The Radial Coordinate Visualization (Radviz) technique has been widely used to effectively evaluate the existence of patterns in highly dimensional data sets. A crucial aspect of this technique lies in the arrangement of the dimensions, which determines the quality of the posterior visualization. Dimension arrangement (DA) has been shown to be an NP-problem and different heuristics have been proposed to solve it using optimization techniques. However, very little work has focused on understanding the relation between the arrangement of the dimensions and the quality of the visualization. In this paper we first present two variations of the DA problem: (1) a Radviz independent approach and (2) a Radviz dependent approach. We then describe the use of the Davies-Bouldin index to automatically evaluate the quality of a visualization i.e., its visual usefulness. Our empirical evaluation is extensive and uses both real and synthetic data sets in order to evaluate our proposed methods and to fully understand the impact that parameters such as number of samples, dimensions, or cluster separability have in the relation between the optimization algorithm and the visualization tool.

Interesting both for exploration of data sets for constructing topic maps and quite possibly for finding the “best” visualizations for a topic map deliverable.

Satrap: Data and Network Heterogeneity Aware P2P Data-Mining

Filed under: Classification,Heterogeneous Data,Networks,Searching,Semantic Diversity — Patrick Durusau @ 6:15 am

Satrap: Data and Network Heterogeneity Aware P2P Data-Mining Authors: Hock Hee Ang, Vivekanand Gopalkrishnan, Anwitaman Datta, Wee Keong Ng, Steven C. H. Hoi Keywords: Distributed classification, P2P network, cascade SVM

Abstract:

Distributed classification aims to build an accurate classifier by learning from distributed data while reducing computation and communication cost. A P2P network where numerous users come together to share resources like data content, bandwidth, storage space and CPU resources is an excellent platform for distributed classification. However, two important aspects of the learning environment have often been overlooked by other works, viz., 1) location of the peers which results in variable communication cost and 2) heterogeneity of the peers’ data which can help reduce redundant communication. In this paper, we examine the properties of network and data heterogeneity and propose a simple yet efficient P2P classification approach that minimizes expensive inter-region communication while achieving good generalization performance. Experimental results demonstrate the feasibility and effectiveness of the proposed solution.

Among the other claims for Satrap:

  • achieves the best accuracy-to-communication cost ratio given that data exchange is performed to improve global accuracy.
  • allows users to control the trade-off between accuracy and communication cost with the user-specified parameters.

I find these two the most interesting.

In part because semantic integration, whether explicit or not, is always a question of cost ratio and tradeoffs.

It would be refreshing to see papers that say what semantic integration would be too costly with method X or that aren’t possible with method Y.

October 10, 2010

An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA

Filed under: Computation,CUDA,Graphic Processors,Graphs,Software — Patrick Durusau @ 11:58 am

An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA Authors: S.A. Arul Shalom, Manoranjan Dash, Minh Tue Keywords: CUDA. Hierarchical clustering, High performance Computing, Computations using Graphics hardware, complete linkage

Abstract:

Graphics Processing Units in today’s desktops can well be thought of as a high performance parallel processor. Each single processor within the GPU is able to execute different tasks independently but concurrently. Such computational capabilities of the GPU are being exploited in the domain of Data mining. Two types of Hierarchical clustering algorithms are realized on GPU using CUDA. Speed gains from 15 times up to about 90 times have been realized. The challenges involved in invoking Graphical hardware for such Data mining algorithms and effects of CUDA blocks are discussed. It is interesting to note that block size of 8 is optimal for GPU with 128 internal processors.

GPUs offer a great deal of processing power and programming them may provoke deeper insights into subject identification and mapping.

Topic mappers may be able to claim NVIDIA based software/hardware and/or Sony Playstation 3 and 4 units (Cell Broadband Engine) as a business expense (check with your tax advisor).

A GPU based paper for TMRA 2011 anyone?

DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud

Filed under: Data Mining,Hadoop,MapReduce,Pattern Recognition — Patrick Durusau @ 10:12 am

DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud Authors: Jen-Wei Huang, Su-Chen Lin, Ming-Syan Chen Keywords: sequential pattern mining, period of interest (POI), customer transactions
Abstract:

The progressive sequential pattern mining problem has been discussed in previous research works. With the increasing amount of data, single processors struggle to scale up. Traditional algorithms running on a single machine may have scalability troubles. Therefore, mining progressive sequential patterns intrinsically suffers from the scalability problem. In view of this, we design a distributed mining algorithm to address the scalability problem of mining progressive sequential patterns. The proposed algorithm DPSP, standing for Distributed Progressive Sequential Pattern mining algorithm, is implemented on top of Hadoop platform, which realizes the cloud computing environment. We propose Map/Reduce jobs in DPSP to delete obsolete itemsets, update current candidate sequential patterns and report up-to-date frequent sequential patterns within each POI. The experimental results show that DPSP possesses great scalability and consequently increases the performance and the practicability of mining algorithms.

The phrase mining sequential patterns was coined in Mining Sequential Patterns, a paper authored by Rakesh Agrawal, Ramakrishnan Srikant, and cited by the authors of this paper.

The original research was to find patterns in customer transactions, which I suspect are important “subjects” for discovery and representation in commerce topic maps.

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction

Filed under: Data Mining,Dimension Reduction,Heterogeneous Data,High Dimensionality — Patrick Durusau @ 9:43 am

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction Authors: Panagis Magdalinos, Michalis Vazirgiannis, Dialecti Valsamou Keywords: distributed non linear dimensionality reduction, NLDR, distributed dimensionality reduction, DDR, distributed data mining, DDM, dimensionality reduction, DR, Distributed Isomap, D-Isomap, C-Isomap, L-Isomap

Abstract:

Data mining tasks results are usually improved by reducing the dimensionality of data. This improvement however is achieved harder in the case that data lay on a non linear manifold and are distributed across network nodes. Although numerous algorithms for distributed dimensionality reduction have been proposed, all assume that data reside in a linear space. In order to address the non-linear case, we introduce D-Isomap, a novel distributed non linear dimensionality reduction algorithm, particularly applicable in large scale, structured peer-to-peer networks. Apart from unfolding a non linear manifold, our algorithm is capable of approximate reconstruction of the global dataset at peer level a very attractive feature for distributed data mining problems. We extensively evaluate its performance through experiments on both artificial and real world datasets. The obtained results show the suitability and viability of our approach for knowledge discovery in distributed environments.

Data mining in peer-to-peer networks will face topic map authors sooner or later.

Not only a useful discussion of the issues, but, the authors have posted source code and data sets used in the article as well:

http://www.db-net.aueb.gr/panagis/PAKDD2010/

October 9, 2010

Evolutionary Clustering and Analysis of Heterogeneous Information Networks

Filed under: Clustering,Evoluntionary,Heterogeneous Data,Networks — Patrick Durusau @ 4:48 pm

Evolutionary Clustering and Analysis of Heterogeneous Information Networks Authors: Manish Gupta; Charu Aggarwal; Jiawei Han; Yizhou Sun Keywords: ENetClus, evolutionary clustering, typed-clustering, DBLP, bibliographic networks

Abstract:

In this paper, we study the problem of evolutionary clustering of multi-typed objects in a heterogeneous bibliographic network. The traditional methods of homogeneous clustering methods do not result in a good typed-clustering. The design of heterogeneous methods for clustering can help us better understand the evolution of each of the types apart from the evolution of the network as a whole. In fact, the problem of clustering and evolution diagnosis are closely related because of the ability of the clustering process to summarize the network and provide insights into the changes in the objects over time. We present such a tightly integrated method for clustering and evolution diagnosis of heterogeneous bibliographic information networks. We present an algorithm, ENetClus, which performs such an agglomerative evolutionary clustering which is able to show variations in the clusters over time with a temporal smoothness approach. Previous work on clustering networks is either based on homogeneous graphs with evolution, or it does not account for evolution in the process of clustering heterogeneous networks. This paper provides the first framework for evolution-sensitive clustering and diagnosis of heterogeneous information networks. The ENetClus algorithm generates consistent typed-clusterings across time, which can be used for further evolution diagnosis and insights. The framework of the algorithm is specifically designed in order to facilitate insights about the evolution process. We use this technique in order to provide novel insights about bibliographic information networks.

Exploring heterogeneous information networks is a first step towards discovery/recognition of new subjects. What other novel insights will emerge from work on heterogeneous information networks only future research can answer.

« Newer PostsOlder Posts »

Powered by WordPress