Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 24, 2010

Text Analysis with LingPipe 4. Draft 0.2

Filed under: Data Mining,Natural Language Processing,Text Analytics — Patrick Durusau @ 9:53 am

Text Analysis with LingPipe 4. Draft 0.2

Draft 0.2 is up to 363 pages.

Chapters:

  1. Getting Started
  2. Characters and Strings
  3. Regular Expressions
  4. Input and Output
  5. Handlers, Parsers, and Corpora
  6. Classifiers and Evaluation
  7. Naive Bayes Classifiers (not done)
  8. Tokenization
  9. Symbol Tables
  10. Sentence Boundary Detection (not done)
  11. Latent Dirichlet Allocation
  12. Singular Value Decomposition (not done)

Extensive annexes.

Projected to see another 1,000 or so pages. So the (not done) chapters will appear along with additional material in other chapters.

Readers welcome!

Christmas came early this year!

Questions:

  1. Class presentation demonstrating use of one of the techniques on library related data set.
  2. Compare and contrast two of the techniques on a library related data set. (Project)
  3. Annotated and updated bibliography for any chapter.

Update: Same questions as before but look at the updated version of the book (split into text processing and NLP as separate parts): LingPipe and Text Processing Books.

November 23, 2010

Dataists – Blog

Filed under: Data Mining — Patrick Durusau @ 7:21 pm

Dataists

A blog for data hackers.

Data mining is something that underlies ever topic map of any size.

Impressive analysis of the Afghan War Diaries.

You may have heard about them and their being posted elsewhere.

Mining Interesting Subgraphs….

Filed under: Data Mining,Sampling,Subgraphs — Patrick Durusau @ 7:00 am

Mining Interesting Subgraphs by Output Space Sampling

Mohammad Al Hasan’s dissertation was the winner of the SIGKDD Ph.D. Dissertation Award.

From the dissertation:

Output space sampling is an entire paradigm shift in frequent pattern mining (FPM) that holds enormous promise. While traditional FPM strives for completeness, OSS targets to obtain a few interesting samples. The definition of interestingness can be very generic, so user can sample patterns from different target distributions by choosing different interestingness functions. This is very beneficial as mined patterns are subject to subsequent use in various knowledge discovery tasks, like classification, clustering, outlier detection, etc. and the interestingness score of a pattern varies for various tasks. OSS can adapt to this requirement just by changing the interestingness function. OSS also solves pattern redundancy problem by finding samples that are very different from each other. Note that, pattern redundancy hurts any knowledge based system that builds metrics based on the structural similarity of the patterns.

Nice to see recognition that for some data sets we don’t need (or require) full enumeration of all occurrences.

Something topic map advocates need to remember when proselytizing for topic maps.

The goal is not all the information known about a subject.

The goal is all the information a user wants about a subject.

Not the same thing.

Questions:

  1. What criteria of “interestingness” would you apply in gathering data for easy access by your patrons? (3-5 pages, no citations)
  2. How would you use this technique for authoring and/or testing a topic map? (3-5 pages, not citations. Think of “testing” a topic map as its representativeness of a particular data collection.
  3. Bibliography of material citing the paper or applying this technique.

November 22, 2010

Minimum Description Length (MDL)

Filed under: Data Mining,Minimum Description Length,Pattern Compression — Patrick Durusau @ 8:36 am

mdl-research.org

From the website:

The purpose of statistical modeling is to discover regularities in observed data. The success in finding such regularities can be measured by the length with which the data can be described. This is the rationale behind the Minimum Description Length (MDL) Principle introduced by Jorma Rissanen (Rissanen, 1978).

” The MDL Principle is a relatively recent method for inductive inference. The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally. ” (Grünwald, 1998)

The website offers a reading list on MDL, demonstrations (with links to software), a list of researchers, related topics and upcoming conferences.

Pattern Compression – 7 Magnitudes of Reduction

Filed under: Data Mining,Minimum Description Length,Pattern Compression — Patrick Durusau @ 8:14 am

Making Pattern Mining Useful.

Jilles Vreeken’s dissertation was a runner-up for the 2010 ACM SIGKDD Dissertation Award.

Vreeken proposes “compression” of data patterns on the basis of Minimum Description Length (MDL) (see The Minimum Description Length Principle) and KRIMP, “a heuristic parameter-free algorithm for finding the optimal set of frequent itemsets.” (SIGKDD, vol. 12, issue 1, page 76)

Readers should take note that experience indicates that KRIMP achieves 7 magnitudes of reduction in patterns. Let me say that again: KRIMP achieves 7 magnitudes of reduction in patterns. In practice, not theory.

Vreeken’s homepage has other materials of interest on this topic.

Questions:

  1. Application of “minimum description length” in library science? (report for class)
  2. How would you apply “minimum description length” techniques in library science? (3-5 pages, citations)
  3. Introduction to “Minimum Description Length For Librarians (class presentation, examples relevant to librarians)

A Term Association Inference Model for Single Documents:….

Filed under: Data Mining,Document Classification,Information Retrieval,Summarization — Patrick Durusau @ 6:36 am

A Term Association Inference Model for Single Documents: A Stepping Stone for Investigation through Information Extraction Author(s): Sukanya Manna and Tom Gedeon Keywords: Information retrieval, investigation, Gain of Words, Gain of Sentences, term significance, summarization

Abstract:

In this paper, we propose a term association model which extracts significant terms as well as the important regions from a single document. This model is a basis for a systematic form of subjective data analysis which captures the notion of relatedness of different discourse structures considered in the document, without having a predefined knowledge-base. This is a paving stone for investigation or security purposes, where possible patterns need to be figured out from a witness statement or a few witness statements. This is unlikely to be possible in predictive data mining where the system can not work efficiently in the absence of existing patterns or large amount of data. This model overcomes the basic drawback of existing language models for choosing significant terms in single documents. We used a text summarization method to validate a part of this work and compare our term significance with a modified version of Salton’s [1].

Excellent work that illustrates how re-thinking of fundamental assumptions of data mining can lead to useful results.

Questions:

  1. Create an annotated bibliography of citations to this article.
  2. Citations of items in the bibliography since this paper (2008)? List and annotate.
  3. How would you use this approach with a document archive project? (3-5 pages, no citations)

November 20, 2010

Associations: The Kind They Pay For

Filed under: Associations,Authoring Topic Maps,Data Mining,Data Structures — Patrick Durusau @ 4:56 pm

Fun at a Department Store: Data Mining Meets Switching Theory Author(s): Anna Bernasconi, Valentina Ciriani, Fabrizio Luccio, Linda Pagli Keywords: SOP, Implicants, Data Mining, Frequent Itemsets, Blulife

Abstract:

In this paper we introduce new algebraic forms, SOP +  and DSOP + , to represent functions f:{0,1}n → ℕ, based on arithmetic sums of products. These expressions are a direct generalization of the classical SOP and DSOP forms.

We propose optimal and heuristic algorithms for minimal SOP +  and DSOP +  synthesis. We then show how the DSOP +  form can be exploited for Data Mining applications. In particular we propose a new compact representation for the database of transactions to be used by the LCM algorithms for mining frequent closed itemsets.

A new technique for extracting associations between items present (or absent) in transactions (sales transactions).

Of interest to people with the funds to pay for data mining and topic maps.

Topic maps are useful to bind the mining of such associations to other information systems, such as supply chains.

Questions:

  1. How would you use data mining of transaction associations to guide collection development? (3-5 pages, with citations)
  2. How would you use topic maps with the mining of transaction associations? (3-5 pages, no citations)
  3. How would you bind an absence of data to other information? (3-5 pages, no citations)

Observation: Intelligence agencies recognize the absence of data as an association. Binding that absence to other date is a job for topic maps.

November 12, 2010

As Time Goes by: Discovering Eras in Evolving Social Networks

Filed under: Clustering,Data Mining,Evoluntionary — Patrick Durusau @ 6:21 pm

As Time Goes by: Discovering Eras in Evolving Social Networks Authors(s): Michele Berlingerio, Michele Coscia, Fosca Giannotti, Anna Monreale, Dino Pedreschi

Abstract:

Within the large body of research in complex network analysis, an important topic is the temporal evolution of networks. Existing approaches aim at analyzing the evolution on the global and the local scale, extracting properties of either the entire network or local patterns. In this paper, we focus instead on detecting clusters of temporal snapshots of a network, to be interpreted as eras of evolution. To this aim, we introduce a novel hierarchical clustering methodology, based on a dissimilarity measure (derived from the Jaccard coefficient) between two temporal snapshots of the network. We devise a framework to discover and browse the eras, either in top-down or a bottom-up fashion, supporting the exploration of the evolution at any level of temporal resolution. We show how our approach applies to real networks, by detecting eras in an evolving co-authorship graph extracted from a bibliographic dataset; we illustrate how the discovered temporal clustering highlights the crucial moments when the network had profound changes in its structure. Our approach is finally boosted by introducing a meaningful labeling of the obtained clusters, such as the characterizing topics of each discovered era, thus adding a semantic dimension to our analysis.

Deeply interesting work.

Questions:

  1. Is is a fair assumption that terms used by one scholar will be used the same way by scholars that cite them? (discussion)
  2. If you think #1 is true, then does entity resolution, etc., however you want to talk about recognition of subjects, apply from the first scholar outwards? If so, how far? (discussion)
  3. If you think #1 is false, why? (discussion)
  4. How would you go about designing a project to identify usages of terms in a body of literature? Such that you could detect changes in usage? What questions would you have to ask? (3-5 pages, citations)

PS: Another way to think about this area is: Do terms have social lives? Is that a useful way to talk about them?

November 9, 2010

Summarizing Multidimensional Data Streams: A Hierarchy-Graph-Based Approach

Filed under: Authoring Topic Maps,Data Mining — Patrick Durusau @ 7:44 pm

Summarizing Multidimensional Data Streams: A Hierarchy-Graph-Based Approach Authors(s): Yoann Pitarch, Anne Laurent, Pascal Poncelet

When dealing with potentially infinite data streams, storing the whole data stream history is unfeasible and providing a high-quality summary is required. In this paper, we propose a summarization method for multidimensional data streams based on a graph structure and taking advantage of the data hierarchies. The summarization method considers the data distribution and thus overcomes a major drawback of the Tilted Time Window common framework. We adapt this structure for synthesizing frequent itemsets extracted on temporal windows. Thanks to our approach, as users do not analyze any more numerous extraction results, the result processing is improved.

As a text scholar, I would presume that all occurrences are stored.

For high speed data streams too large to store, that are read in one pass, that isn’t an option.

If terabytes of high speed data are on your topic mapping horizon, start here.

****
PS: Posts on temporal modeling with proxies to follow (but not real soon).

Rule Synthesizing from Multiple Related Databases

Filed under: Clustering,Data Mining,Heterogeneous Data,Uncategorized — Patrick Durusau @ 7:33 pm

Rule Synthesizing from Multiple Related Databases Authors(s): Dan He, Xindong Wu, Xingquan Zhu Keywords: Association rule mining, rule synthesizing, multiple databases, clustering

In this paper, we study the problem of rule synthesizing from multiple related databases where items representing the databases may be different, and the databases may not be relevant, or similar to each other. We argue that, for such multi-related databases, simple rule synthesizing without a detailed understanding of the databases is not able to reveal meaningful patterns inside the data collections. Consequently, we propose a two-step clustering on the databases at both item and rule levels such that the databases in the final clusters contain both similar items and similar rules. A weighted rule synthesizing method is then applied on each such cluster to generate final rules. Experimental results demonstrate that the new rule synthesizing method is able to discover important rules which can not be synthesized by other methods.

The authors observe:

…existing rule synthesizing methods for distributed mining commonly assumes that related databases are relevant, share similar data distributions, and have identical items. This is equivalent to the assumption that all stores have the same type of business with identical meta-data structures, which is hardly the case in practice.

I should start collecting quotes that recognize semantic diversity as the rule rather than the exception.

More on that later. Enjoy the article.

November 8, 2010

Combining the Missing Link: An Incremental Topic Model of Document Content and Hyperlink

Filed under: Classification,Data Mining,Link-IPLSI — Patrick Durusau @ 7:59 am

Combining the Missing Link: An Incremental Topic Model of Document Content and Hyperlink Authors: Huifang Ma, Zhixin Li and Zhongzhi Shi Keywords: Topic model, Link-IPLSI, Incremental Learning, Adaptive Asymmetric learning

Abstract:

The content and structure of linked information such as sets of web pages or research paper archives are dynamic and keep on changing. Even though different methods are proposed to exploit both the link structure and the content information, no existing approach can effectively deal with this evolution. We propose a novel joint model, called Link-IPLSI, to combine texts and links in a topic modeling framework incrementally. The model takes advantage of a novel link updating technique that can cope with dynamic changes of online document streams in a faster and scalable way. Furthermore, an adaptive asymmetric learning method is adopted to freely control the assignment of weights to terms and citations. Experimental results on two different sources of online information demonstrate the time saving strength of our method and indicate that our model leads to systematic improvements in the quality of classification.

Questions:

  1. Timed expiration of documents and terms? Appropriate for library settings? (discussion)
  2. Citations treated same as hyperlinks? (Aren’t citations more granular?) (3-5 pages, citations)
  3. What do we lose by citation to documents and not concepts/locations in documents? (3-5 pages, citations)

PS: The updating aspects of this paper are very important. Static data exists but isn’t very common in enterprise applications.

November 7, 2010

Parallel Implementation of Classification Algorithms Based on MapReduce

Filed under: Classification,Data Mining,Hadoop,MapReduce — Patrick Durusau @ 8:31 pm

Parallel Implementation of Classification Algorithms Based on MapReduce Authors: Qing He, Fuzhen Zhuang, Jincheng Li and Zhongzhi Shi Keywords: Data Mining, Classification, Parallel Implementation, Large Dataset, MapReduce

Abstract:

Data mining has attracted extensive research for several decades. As an important task of data mining, classification plays an important role in information retrieval, web searching, CRM, etc. Most of the present classification techniques are serial, which become impractical for large dataset. The computing resource is under-utilized and the executing time is not waitable. Provided the program mode of MapReduce, we propose the parallel implementation methods of several classification algorithms, such as k-nearest neighbors, naive bayesian model and decision tree, etc. Preparatory experiments show that the proposed parallel methods can not only process large dataset, but also can be extended to execute on a cluster, which can significantly improve the efficiency.

From the paper:

In this paper, we introduced the parallel implementation of several classification algorithms based on MapReduce, which make them be applicable to mine large dataset. The key is to design the proper key/value pairs. (emphasis in original)

Questions:

  1. Annotated bibliography of parallel classification algorithms (newer than this paper, 3-5 pages, citations)
  2. Report for class on application of parallel classification algorithms (report + paper)
  3. Application of parallel classification algorithm to a library dataset (project)
  4. Can the key/value pairs be interchanged with others? Yes/no, why? (3-5 pages, no citations.)

November 6, 2010

The University of Amsterdam’s Concept Detection System at ImageCLEF 2009

Filed under: Concept Detection,Data Mining,Multimedia — Patrick Durusau @ 6:40 am

The University of Amsterdam’s Concept Detection System at ImageCLEF 2009. Authors: Koen E. A. van de Sande, Theo Gevers and Arnold W. M. Smeulders Keywords: Color, Invariance, Concept Detection, Object and Scene Recognition, Bag-of-Words, Photo Annotation, Spatial Pyramid

Abstract:

Our group within the University of Amsterdam participated in the large-scale visual concept detection task of ImageCLEF 2009. Our experiments focus on increasing the robustness of the individual concept detectors based on the bag-of-words approach, and less on the hierarchical nature of the concept set used. To increase the robustness of individual concept detectors, our experiments emphasize in particular the role of visual sampling, the value of color invariant features, the influence of codebook construction, and the effectiveness of kernel-based learning parameters. The participation in ImageCLEF 2009 has been successful, resulting in the top ranking for the large-scale visual concept detection task in terms of both EER and AUC. For 40 out of 53 individual concepts, we obtain the best performance of all submissions to this task. For the hierarchical evaluation, which considers the whole hierarchy of concepts instead of single detectors, using the concept likelihoods estimated by our detectors directly works better than scaling these likelihoods based on the class priors.

Good example of the content to expect from ImageCLEF papers.

This is a very important area of rapidly developing research.

ImageCLEF – The CLEF Cross Language Image Retrieval Track

Filed under: Concept Detection,Data Mining,Multimedia — Patrick Durusau @ 6:16 am

ImageCLEF – The CLEF Cross Language Image Retrieval Track.

The European side of working with digital video.

From the 2009 event website:

ImageCLEF is the cross-language image retrieval track run as part of the Cross Language Evaluation Forum (CLEF) campaign. This track evaluates retrieval of images described by text captions based on queries in a different language; both text and image matching techniques are potentially exploitable.

TREC Video Retrieval Evaluation

Filed under: Concept Detection,Data Mining,Multimedia — Patrick Durusau @ 5:59 am

TREC Video Retrieval Evaluation.

Since I have posted several resources on digital video and concept discovery today, listing the TREC track on the same seemed appropriate.

From the website:

The TREC conference series is sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies. The goal of the conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. In 2001 and 2002 the TREC series sponsored a video “track” devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video. Beginning in 2003, this track became an independent evaluation (TRECVID) with a workshop taking place just before TREC.

You will find publications, tools, bibliographies, data sets, etc., first class resource site.

Internet Multimedia Search and Mining

Filed under: Concept Detection,Data Mining,Domain Change,Multimedia — Patrick Durusau @ 5:51 am

Internet Multimedia Search and Mining Authors: Xian-Sheng Hua, Marcel Worring, and Tat-Seng Chua

Abstract:

In this chapter, we address the visual learning of automatic concept detectors from web video as available from services like YouTube. While allowing a much more efficient, flexible, and scalable concept learning compared to expert labels, web-based detectors perform poorly when applied to different domains (such as specific TV channels). We address this domain change problem using a novel approach, which – after an initial training on web content – performs a highly efficient online adaptation on the target domain.

In quantitative experiments on data from YouTube and from the TRECVID campaign, we first validate that domain change appears to be the key problem for web-based concept learning, with much more significant impact than other phenomena like label noise. Second, the proposed adaptation is shown to improve the accuracy of web-based detectors significantly, even over SVMs trained on the target
domain. Finally, we extend our approach with active learning such that adaptation can be interleaved with manual annotation for an efficient exploration of novel domains.

The authors cite authority for the proposition that by 2013 that 91% of all Internet traffic will be digital video.

Perhaps, perhaps not, but in any event, “concept detection” is an important aid to topic map authors working with digital video.

Questions:

  1. Later research on “concept detection” in digital video? (annotated bibliography)
  2. Use in library contexts? (3-5 pages, citations)
  3. How would you design human augmentation of automated detection? (project)

November 4, 2010

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

Filed under: Authoring Topic Maps,Clustering,Data Mining — Patrick Durusau @ 11:26 am

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise (1996) Authors: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu Keywords: Clustering Algorithms, Arbitrary Shape of Clusters, Efficiency on Large Spatial Databases, Handling Noise.

Before you decide to skip this paper as “old” consider that it has > 600 citations in CiteSeer.

Abstract:

Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLAR-ANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

Discovery of classes is always an issue in topic map authoring/design and clustering is one way to find classes, perhaps even ones you did not suspect existed.

November 3, 2010

Aperture: a Java framework for getting data and metadata

Filed under: Data Mining,Software — Patrick Durusau @ 7:07 pm

Aperture: a Java framework for getting data and metadata

From the website:

Aperture is an open source library for crawling and indexing information sources such as file systems, websites and mail boxes. Aperture supports a number of common source types and document formats out-of-the-box and provides easy ways to extend it with custom implementations.

Aperture wiki

Example applications include:

  • bibsonomycrawler.bat – crawls Bibsonomy accounts, extracts bookmarks and tags
  • deliciouscrawler.bat – crawls delicious accounts, extracts bookmarks and tags
  • filecrawler.bat – crawls filesystems, extracts the folder structure, the file metadata and the file content
  • flickrcrawler.bat – crawls flickr accounts, extracts tags, and photos metadata
  • icalcrawler.bat – crawls calendars stored in the well-known iCalendar format, extracts events, todos, journal entires etc.
  • imapcrawler.bat – crawls remote mailboxes accessible with IMAP
  • mboxcrawler.bat – crawls local mailboxes stored in mbox-format files (e.g. those from thunderbird)
  • outlookcrawler.bat – makes a connection with the outlook instance and crawls appointments, contacts and emails, note that this crawler will obviously only work in Windows if the MS Outlook is installed
  • thunderbirdcrawler.bat – crawls a thunderbird addressbook, extracts contacts, note that for crawling emails – use the mboxcrawler
  • webcrawler.bat – crawls websites

More tools for your topic map toolbox!

November 1, 2010

Introduction to Graphical Models for Data Mining

Filed under: Data Mining,Graphs,Machine Learning — Patrick Durusau @ 4:32 pm

Introduction to Graphical Models for Data Mining by Arindam Banerjee, Department of Computer Science and Engineering, University of Minnesota.

Abstract:

Graphical models for large scale data mining constitute an exciting development in statistical data analysis which has gained significant momentum in the past decade. Unlike traditional statistical models which often make `i.i.d.’ assumptions, graphical models acknowledge dependencies among variables of interest and investigate inference/prediction while taking into account such dependencies. In recent years, latent variable Bayesian networks, such as latent Dirichlet allocation, stochastic block models, Bayesian co-clustering, and probabilistic matrix factorization techniques have achieved unprecedented success in a variety of application domains including topic modeling and text mining, recommendation systems, multi-relational data analysis, etc. The tutorial will give a broad overview of graphical models, and discuss recent developments in the context of mixed-membership models, matrix analysis models, and their generalizations. The tutorial will present a balanced mix of models, inference/learning methods, and applications.

Slides (pdf)
Slides (ppt)

If you plan on using data mining as a source for authoring topic maps, graphical models are on your reading list.

Questions:

  1. Would you use the results of a Bayesian network to author an entry in a topic map? Why/why not? (2-3 pages, no citations)
  2. Would you use the results of a Bayesian network to author an entry in a library catalog? Why/why not? (2-3 pages, no citations)
  3. Do we attribute certainty to library catalog entries that are actually possible entries for a particular item? (discussion question)
  4. Examples of the use of Bayesian networks in classification for library catalogs?

October 21, 2010

mloss.org – machine learning open source software

mloss.org – machine learning open source software

Open source repository of machine learning software.

Not only are subjects being recognized by these software packages but their processes and choices are subjects as well. Not to mention their description in the literature.

Fruitful grounds for adaptation to topic maps as well as being the subject of topic maps.

There are literally hundreds of software packages here so I welcome suggestions, comments, etc. on any and all of them.

Questions:

  1. Examples of vocabulary mis-match in machine learning literature?
  2. Using one sample data set, how would you integrate results from different packages? Assume you are not merging classifiers.
  3. What if the classifiers are unknown? That is all you have are the final results. Is your result different? Reliable?
  4. Describe a (singular) merging of classifiers in subject identity terms.

October 17, 2010

Finding What You Want

Filed under: Data Mining,Music Retrieval,Semantics,Similarity — Patrick Durusau @ 5:00 am

The Known World, a column/blog by David Alan Grier, appears both online and in Computer, a publication of the IEEE Computer Society. Finding What You Want appears in the September, 2010 issue of Computer.

Grier explores how Pandora augments our abilities to explore the vastness of musical space. Musical retrieval systems for years had static categories imposed upon them and those work for some purposes. But, also impose requirements upon users for retrieval.

According to Grier, the “Great Napster Crisis of 1999-2001,” resulted in a new field of music retrieval systems because current areas did not quite fit.

I find Grier’s analysis interesting because to his suggestion that the methods by which we find information of interest can shape what we consider as fitting our search criteria.

Perhaps, just perhaps, identifying subjects isn’t quite the string matching, cut-n-dried, approach that is the common approach. Music retrieval systems may be a fruitful area to look for clues as to how to improve more tradition information systems.

Questions:

  1. Review Music Retrieval: A Tutorial and Review. (Somewhat dated, can you suggest a replacement?)
  2. Pick two or three techniques used for retrieval of music. How would you adapt those for texts?
  3. How would you test your adapted techniques against a text collection?

October 16, 2010

Proceedings of the Very Large Database Endowment Inc.

Filed under: Data Mining,Searching,SQL — Patrick Durusau @ 7:11 am

Proceedings of the Very Large Database Endowment Inc.

A resource made available by the Very Large Database Endowment Inc. Who also publish the The VLDB Journal.

With titles like: Scalable multi-query optimization for exploratory queries over federated scientific databases (http://www.vldb.org/pvldb/1/1453864.pdf if you are interested), the interest factor for topic mappers is obvious.

Questions:

  1. What library journals do you scan every week/month? What subject areas
  2. What CS journals do you scan every week/month? What subject areas?
  3. Pick two different subject areas to follow for the next two months.
  4. What reading strategies did you use for the additional materials?
  5. What did you see/learn that you would have otherwise missed?

PS: Turnabout is fair play. The class can decide on two subjects areas with up to 5 journals (total) that I should be following.

October 11, 2010

Finding Itemset-Sharing Patterns in a Large Itemset-Associated Graph

Filed under: Data Mining,Graphs,Similarity,Subject Identity — Patrick Durusau @ 6:37 am

Finding Itemset-Sharing Patterns in a Large Itemset-Associated Graph Authors: Mutsumi Fukuzaki, Mio Seki, Hisashi Kashima, Jun Sese

Abstract:

Itemset mining and graph mining have attracted considerable attention in the field of data mining, since they have many important applications in various areas such as biology, marketing, and social network analysis. However, most existing studies focus only on either itemset mining or graph mining, and only a few studies have addressed a combination of both. In this paper, we introduce a new problem which we call itemset-sharing subgraph (ISS) set enumeration, where the task is to find sets of subgraphs with common itemsets in a large graph in which each vertex has an associated itemset. The problem has various interesting potential applications such as in side-effect analysis in drug discovery and the analysis of the influence of word-of-mouth communication in marketing in social networks. We propose an efficient algorithm ROBIN for finding ISS sets in such graph; this algorithm enumerates connected subgraphs having common itemsets and finds their combinations. Experiments using a synthetic network verify that our method can efficiently process networks with more than one million edges. Experiments using a real biological network show that our algorithm can find biologically interesting patterns. We also apply ROBIN to a citation network and find successful collaborative research works.

If you think of a set of properties, “itemset,” as a topic and an “itemset-sharing subgraph (ISS)” as a match/merging criteria, the relevance of this paper to topic maps becomes immediately obvious.

Useful for both discovery of topics in data sets as well as part processing a topic map.

October 10, 2010

DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud

Filed under: Data Mining,Hadoop,MapReduce,Pattern Recognition — Patrick Durusau @ 10:12 am

DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud Authors: Jen-Wei Huang, Su-Chen Lin, Ming-Syan Chen Keywords: sequential pattern mining, period of interest (POI), customer transactions
Abstract:

The progressive sequential pattern mining problem has been discussed in previous research works. With the increasing amount of data, single processors struggle to scale up. Traditional algorithms running on a single machine may have scalability troubles. Therefore, mining progressive sequential patterns intrinsically suffers from the scalability problem. In view of this, we design a distributed mining algorithm to address the scalability problem of mining progressive sequential patterns. The proposed algorithm DPSP, standing for Distributed Progressive Sequential Pattern mining algorithm, is implemented on top of Hadoop platform, which realizes the cloud computing environment. We propose Map/Reduce jobs in DPSP to delete obsolete itemsets, update current candidate sequential patterns and report up-to-date frequent sequential patterns within each POI. The experimental results show that DPSP possesses great scalability and consequently increases the performance and the practicability of mining algorithms.

The phrase mining sequential patterns was coined in Mining Sequential Patterns, a paper authored by Rakesh Agrawal, Ramakrishnan Srikant, and cited by the authors of this paper.

The original research was to find patterns in customer transactions, which I suspect are important “subjects” for discovery and representation in commerce topic maps.

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction

Filed under: Data Mining,Dimension Reduction,Heterogeneous Data,High Dimensionality — Patrick Durusau @ 9:43 am

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction Authors: Panagis Magdalinos, Michalis Vazirgiannis, Dialecti Valsamou Keywords: distributed non linear dimensionality reduction, NLDR, distributed dimensionality reduction, DDR, distributed data mining, DDM, dimensionality reduction, DR, Distributed Isomap, D-Isomap, C-Isomap, L-Isomap

Abstract:

Data mining tasks results are usually improved by reducing the dimensionality of data. This improvement however is achieved harder in the case that data lay on a non linear manifold and are distributed across network nodes. Although numerous algorithms for distributed dimensionality reduction have been proposed, all assume that data reside in a linear space. In order to address the non-linear case, we introduce D-Isomap, a novel distributed non linear dimensionality reduction algorithm, particularly applicable in large scale, structured peer-to-peer networks. Apart from unfolding a non linear manifold, our algorithm is capable of approximate reconstruction of the global dataset at peer level a very attractive feature for distributed data mining problems. We extensively evaluate its performance through experiments on both artificial and real world datasets. The obtained results show the suitability and viability of our approach for knowledge discovery in distributed environments.

Data mining in peer-to-peer networks will face topic map authors sooner or later.

Not only a useful discussion of the issues, but, the authors have posted source code and data sets used in the article as well:

http://www.db-net.aueb.gr/panagis/PAKDD2010/

October 6, 2010

Mining Historic Query Trails to Label Long and Rare Search Engine Queries

Filed under: Authoring Topic Maps,Data Mining,Entity Extraction,Search Engines,Searching — Patrick Durusau @ 7:05 am

Mining Historic Query Trails to Label Long and Rare Search Engine Queries Authors: Peter Bailey, Ryen W. White, Han Liu, Giridhar Kumaran Keywords: Long queries, query labeling

Abstract:

Web search engines can perform poorly for long queries (i.e., those containing four or more terms), in part because of their high level of query specificity. The automatic assignment of labels to long queries can capture aspects of a user’s search intent that may not be apparent from the terms in the query. This affords search result matching or reranking based on queries and labels rather than the query text alone. Query labels can be derived from interaction logs generated from many users’ search result clicks or from query trails comprising the chain of URLs visited following query submission. However, since long queries are typically rare, they are difficult to label in this way because little or no historic log data exists for them. A subset of these queries may be amenable to labeling by detecting similarities between parts of a long and rare query and the queries which appear in logs. In this article, we present the comparison of four similarity algorithms for the automatic assignment of Open Directory Project category labels to long and rare queries, based solely on matching against similar satisfied query trails extracted from log data. Our findings show that although the similarity-matching algorithms we investigated have tradeoffs in terms of coverage and accuracy, one algorithm that bases similarity on a popular search result ranking function (effectively regarding potentially-similar queries as “documents”) outperforms the others. We find that it is possible to correctly predict the top label better than one in five times, even when no past query trail exactly matches the long and rare query. We show that these labels can be used to reorder top-ranked search results leading to a significant improvement in retrieval performance over baselines that do not utilize query labeling, but instead rank results using content-matching or click-through logs. The outcomes of our research have implications for search providers attempting to provide users with highly-relevant search results for long queries.

(Apologies for repeating the long abstract but this needs wider notice.)

What the authors call “label prediction algorithms,” is a step in mining data for subjects.

The research may also improve search results through the use of labels for ranking.

September 28, 2010

Mining Billion-node Graphs: Patterns, Generators and Tools

Filed under: Authoring Topic Maps,Data Mining,Graphs,Software,Subject Identity — Patrick Durusau @ 9:38 am

Mining Billion-node Graphs: Patterns, Generators and Tools Author: Christos Faloutsos (CMU)

Presentation on the Pegasus (PETRA GrAph mining System) project.

If you have large amounts of real world data and need some motivation, take a look at this presentation.

September 23, 2010

HUGO Gene Nomenclature Committee

Filed under: Bioinformatics,Biomedical,Data Mining,Entity Extraction,Indexing,Software — Patrick Durusau @ 8:32 am

HUGO Gene Nomenclature Committee, a committee assigning unique names to genes.

Become familiar with the HUGO site, then read: The success (or not) of HUGO nomenclature (Genome Biology, 2006).

Now read: Moara: a Java library for extracting and normalizing gene and protein mentions (BMC Bioinformatics 2010)

Q: How you would apply the techniques in the Moara article to build a topic map? Would you keep/discard normalization?

PS: Moara Project (software, etc.)

September 14, 2010

International Journal of Approximate Reasoning – Volume 51, Issue 8, October 2010

Filed under: Data Mining,Similarity,Subject Identity — Patrick Durusau @ 3:49 am

International Journal of Approximate Reasoning – Volume 51, Issue 8, October 2010 has a couple of items of interest:

September 9, 2010

Calibrated Leakage?

Filed under: Data Mining,Examples,Subject Identity,Topic Maps — Patrick Durusau @ 6:36 pm

Unlike leaks from a faucet, only some leaks from the Obama Whitehouse annoy the administration.

All administrations approve of their “leaks” and dislike unfavorable “leaks.” In either case, it is an information mapping issue.

First, people who have access to particular documents or facts become topics. Their known associates, from FBI background checks, Facebook pages, etc., also become topics. Form associations between them.

Second, phone traffic and visitor/day book log entries become topics and build associations with Whitehouse staff and their friends.

Third, documents with high likelihood to have “leakable” stories or facts, are topics with timed associations as they fan out across the staff.

Fourth, “leaks” in the media, particularly by time of the disclosure, are captured as topics as well as who reported it, etc.

No magic, just automating and making correlations between information and records that already exist in disparate forms.

A topic map enables estimates of how effective approved “leaks” are propagating or investigation of the sources of unapproved “leaks.”

Topic maps: calibrating leakage.

******
PS: There are defenses to highly correlated data gathering/analysis. Please inquire.

« Newer PostsOlder Posts »

Powered by WordPress