Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 15, 2011

How To Model Search Term Data To Classify User Intent & Match Query Expectations – Post

Filed under: Authoring Topic Maps,Data Mining,Interface Research/Design,Search Data — Patrick Durusau @ 5:49 pm

How To Model Search Term Data To Classify User Intent & Match Query Expectations by Mark Sprague, courtesy of Searchengineland.com is an interesting piece on analysis of search data to extract user intent.

As interesting as that is, I think it could be used by topic map authors for a slightly different purpose.

What if we were to use search data to classify how users were seeking particular subjects?

That is to mine search data for patterns of subject identification, which really isn’t all that different than deciding what product or what service to market to a user.

As a matter of fact, I suspect that many of the tools used by marketeers could be dual purposed to develop subject identifications for non-marketing information systems.

Such as library catalogs or professional literature searches.

The later being often pay-per-view, maintaining high customer satisfaction means repeat business and work of mouth advertising.

I am sure there is already literature on this sort of mining of search data for subject identifications. If you have a pointer or two, please send them my way.

January 13, 2011

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing – Post

Filed under: Data Mining,Similarity — Patrick Durusau @ 5:42 am

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing

Bob Carpenter of Ling-Pipe Blog points out the treatment of Jaccard distance in Mining Massive Datasets by Anand Rajaraman and Jeffrey D. Ullman.

Worth a close look.

January 10, 2011

Large Scale Data Mining Using Genetics-Based Machine Learning

Filed under: Data Mining,Machine Learning — Patrick Durusau @ 3:16 pm

Large Scale Data Mining Using Genetics-Based Machine Learning Authors: Jaume Bacardit, Xaiver Llorà

Tutorial on data mining with genetics-based machine learning algorithms.

Usual examples of exploding information from genetics to high energy physics.

While those are good examples, it really isn’t necessary to go there in order to get large scale data sets.

Imagine constructing a network for all the entities and their relationships in a single issue of the New York Times.

That data isn’t as easily available or to process as genetic databases or results from the Large Hadron Collider.

But that is a question of ease of access and processing, not being large scale data.

The finance pages alone have listings for all the major financial institutions in the country. What about mapping their relationships to each other?

Or for that matter, mapping the phone calls, emails and other communications between the stock trading houses? Broken down by subjects discussed.

Important problems often as not have data that is difficult to acquire. Doesn’t make them any less important problems.

January 9, 2011

Apache UIMA

Apache UIMA

From the website:

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

UIMA enables applications to be decomposed into components, for example “language identification” => “language specific segmentation” => “sentence boundary detection” => “entity detection (person/place names etc.)”. Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.

UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.

The UIMA project offers a number of annotators that produce structured information from unstructured texts.

If you are using UIMA as a framework for development of topic maps, please post concerning your experiences with UIMA. What works, what doesn’t, etc.

January 7, 2011

openNLP

Filed under: Data Mining,Natural Language Processing — Patrick Durusau @ 2:13 pm

openNLP

From the website:

OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.

OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package.

OpenNLP is incubating at the Apache Software Foundation (ASF).

Another set of NLP tools for topic map authors.

Apache OODT – Top Level Project

Filed under: Data Integration,Data Mining,Data Models,OODT,Software — Patrick Durusau @ 6:02 am

Apache OODT is the first ASF Top Level Project status for NASA developed software.

From the website:

Just what is Apache™ OODT?

It’s metadata for middleware (and vice versa):

  • Transparent access to distributed resources
  • Data discovery and query optimization
  • Distributed processing and virtual archives

But it’s not just for science! It’s also a software architecture:

  • Models for information representation
  • Solutions to knowledge capture problems
  • Unification of technology, data, and metadata

Looks like a project that could benefit from having topic maps as part of its tool kit.

Check out the 0.1 OODT release and see what you think.

Apache Mahout – Data Mining Class

Filed under: Data Mining,Mahout — Patrick Durusau @ 5:27 am

Apache Mahout – Data Mining Class at the Illinois Institute of Technology, by Dr. David Grossman.

Grossman is the co-author of: Information Retrieval: Algorithms and Heuristics (The Information Retrieval Series)(2nd Edition)

The class was organized by Grant Ingersoll, see: Apache Mahout Catching on in Academia.

Definitely worth a visit to round out your data mining skills.

January 3, 2011

The Dow Piano

Filed under: Data Mining,Humor — Patrick Durusau @ 5:32 pm

The Dow Piano

A representation of a year of the DOW Industrial Average as a graph and as musical notes.

At first I was going to post it as something too bizarre to pass up.

But, then it occurred to me that representing data in unexpected ways, such as musical notes, could be an interesting way to explore data.

I am not promising that you will find anything based upon converting stock trades from the various houses into musical notes. But you won’t know unless you look.

What will be amusing is that if and when patterns are found, it will be like the rabbit/duck images, it won’t be possible to see the data without seeing the pattern.

Modeling Social Annotation: A Bayesian Approach

Filed under: Bayesian Models,Data Mining,Tagging — Patrick Durusau @ 2:46 pm

Modeling Social Annotation: A Bayesian Approach Authors: Anon Plangprasopchok, Kristina Lerman

Abstract:

Collaborative tagging systems, such as Delicious, CiteULike, and others, allow users to annotate resources, for example, Web pages or scientific papers, with descriptive labels called tags. The social annotations contributed by thousands of users can potentially be used to infer categorical knowledge, classify documents, or recommend new relevant information. Traditional text inference methods do not make the best use of social annotation, since they do not take into account variations in individual users’ perspectives and vocabulary. In a previous work, we introduced a simple probabilistic model that takes the interests of individual annotators into account in order to find hidden topics of annotated resources. Unfortunately, that approach had one major shortcoming: the number of topics and interests must be specified a priori. To address this drawback, we extend the model to a fully Bayesian framework, which offers a way to automatically estimate these numbers. In particular, the model allows the number of interests and topics to change as suggested by the structure of the data. We evaluate the proposed model in detail on the synthetic and real-world data by comparing its performance to Latent Dirichlet Allocation on the topic extraction task. For the latter evaluation, we apply the model to infer topics of Web resources from social annotations obtained from Delicious in order to discover new resources similar to a specified one. Our empirical results demonstrate that the proposed model is a promising method for exploiting social knowledge contained in user-generated annotations.

Questions:

  1. How does (if it does) a tagging vocabulary different from a regular vocabulary? (3-5 pages, no citations)
  2. Would this technique be application to tracing vocabulary usage across cited papers? In other words, following an author backwards through materials they cite? (3-5 pages, no citations)
  3. What other characteristics do you think a paper would have where the usage of a term had shifted to a different meaning? (3-5 pages, no citations)

December 30, 2010

Mining Travel Resources on the Web Using L-Wrappers

Filed under: Data Mining,Inductive Logic Programming (ILP),L-wrappers — Patrick Durusau @ 7:57 am

Mining Travel Resources on the Web Using L-Wrappers Authors Elvira Popescu, Amelia Bădică , and Costin Bădică

Abstract:

The work described here is part of an ongoing research on the application of general-purpose inductive logic programming, logic representation of wrappers (L-wrappers) and XML technologies (including the XSLT transformation language) to information extraction from the Web. The L-wrappers methodology is based on a sound theoretical approach and has already proved its efficacy on a smaller scale, in the area of collecting product information. This paper proposes the use of L-wrappers for tuple extraction from HTML in the domain of e-tourism. It also describes a method for translating L-wrappers into XSLT and illustrates it with the example of a real-world travel agency Web site.

Deeply interesting work in part due to the use of XSLT to extract tuples from HTML pages but also because a labeled ordered tree is used as an interpretive domain for patterns matched against the tree.

If that latter sounds familiar, it should, most data mining techniques specifying a domain in which results (intermediate or otherwise), are going to be interpreted.

I will look around for other material on L-wrappers and inductive logic programming.

December 29, 2010

Setting Government Data Free With ScraperWiki

Filed under: Data Mining,Data Source — Patrick Durusau @ 2:12 pm

Setting Government Data Free With ScraperWiki

reports one a video by Max Ogden illustrating on the use of ScraperWiki to harvest government data.

If you are planning on adding government data to your topic map, this is a video you need to see.

December 27, 2010

Python Text Processing with NLTK2.0 Cookbook – Review Forthcoming!

Filed under: Classification,Data Analysis,Data Mining,Natural Language Processing — Patrick Durusau @ 2:25 pm

Just a quick placeholder to say that I am reviewing Python Text Processing with NLTK2.0 Cookbook

Python Text Processing

I should have the review done in the next couple of weeks.

In the longer term I will be developing a set of notes on the construction of topic maps using this toolkit.

While you wait for the review, you might enjoy reading: Chapter No.3 – Creating Custom Corpora (free download).

December 23, 2010

ScraperWiki

Filed under: Authoring Topic Maps,Data Mining,Data Source — Patrick Durusau @ 1:47 pm

ScraperWiki

The website describes traditional screen scraping and then says:

ScraperWiki is an online tool to make that process simpler and more collaborative. Anyone can write a screen scraper using the online editor, and the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code. And, if you’re not a programmer yourself, you can request a scraper or ask the ScraperWiki team to write one for you.

Interesting way to promote the transition to accessible and structured data.

One step closer to incorporation into or being viewed by a topic map!

December 18, 2010

Data trails reconstruction at the community level in the Web of data – Presentation

Filed under: Co-Words,Data Mining,Subject Identity — Patrick Durusau @ 9:30 am

David Chavalarias: Video from SOKS: Self-Organising Knowledge Systems, Amsterdam, 29 April 2010.

Abstract:

Socio-semantic networks continuously produce data over the Web in a time consistent manner. From scientific communities publishing new findings in archives to citizens confronting their opinions in blogs, there is a real challenge to reconstruct, at the community level, the data trails they produce in order to have a global representation of the topics unfolding in these public arena. We will present such methods of reconstruction in the framework of co-word analysis, highlighting perspectives for the development of innovative tools for our daily interactions with their productions.

I wasn’t able to get very good sound quality for this presentation and there were no slides. However, I was interested enough to find the author’s home page: David Chavalarias and a wealth of interesting material.

I will be watching his projects for some very interesting results and suggest that you do the same.

December 17, 2010

Data Driven Journalism

Filed under: Data Mining,R — Patrick Durusau @ 2:15 pm

Data Driven Journalism

Report of a presentation by Peter Aldhous, San Francisco Bureau Chief of New Scientist Magazine to the Bay area Ruser group.

Main focus is on the use of data in journalism with coverage of the use of R.

Needs only minor tweaking to make an excellent case for topic maps in journalism.

December 14, 2010

Duplicate and Near Duplicate Documents Detection: A Review

Filed under: Data Mining,Duplicates — Patrick Durusau @ 7:24 am

Duplicate and Near Duplicate Documents Detection: A Review Authors: J Prasanna Kumar, P Govindarajulu Keywords: Web Mining, Web Content Mining, Web Crawling, Web pages, Duplicate Document, Near duplicate pages, Near duplicate detection

Abstract:

The development of Internet has resulted in the flooding of numerous copies of web documents in the search results making them futilely relevant to the users thereby creating a serious problem for internet search engines. The outcome of perpetual growth of Web and e-commerce has led to the increase in demand of new Web sites and Web applications. Duplicated web pages that consist of identical structure but different data can be regarded as clones. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. The problem has been deliberated for diverse data types (e.g. textual documents, spatial points and relational records) in diverse settings. Another contemporary materialization of the problem is the efficient identification of near-duplicate Web pages. This is certainly challenging in the web-scale due to the voluminous data and high dimensionalities of the documents. This survey paper has a fundamental intention to present an up-to-date review of the existing literature in duplicate and near duplicate detection of general documents and web documents in web crawling. Besides, the classification of the existing literature in duplicate and near duplicate detection techniques and a detailed description of the same are presented so as to make the survey more comprehensible. Additionally a brief introduction of web mining, web crawling, and duplicate document detection are also presented.

Questions:

Duplicate document detection is a rapidly evolving field.

  1. What general considerations would govern a topic map to remain current in this field?
  2. What would we need to extract from this paper to construct such a map?
  3. What other technologies would we need to use in connection with such a map?
  4. What data sources should we use for such a map?

December 13, 2010

Machine Learning and Data Mining with R – Post

Filed under: Bayesian Models,Data Mining,R — Patrick Durusau @ 7:58 pm

Machine Learning and Data Mining with R

Announcement of course notes and slides, plus live classes in San Francisco, January 2012, courtesy of the Revolutions blog from Revolution Analytics.

Check the post for details and links.

December 12, 2010

Szl – A Compiler and Runtime for the Sawzall Language

Filed under: Data Mining,Software — Patrick Durusau @ 5:52 pm

Szl – A Compiler and Runtime for the Sawzall Language

From the website:

Szl is a compiler and runtime for the Sawzall language. It includes support for statistical aggregation of values read or computed from the input. Google uses Sawzall to process log data generated by Google’s servers.

Since a Sawzall program processes one record of input at a time and does not preserve any state (values of variables) between records, it is well suited for execution as the map phase of a map-reduce. The library also includes support for the statistical aggregation that would be done in the reduce phase of a map-reduce.

The reading of one record at a time reminds me of the record linkage work that was developed in the late 1950’s in medical epidemiology.

Of course, there the records were converted into a uniform presentation, losing their original equivalents to column headers, etc. So the technique began with semantic loss.

I suppose you could say it was a lossy semantic integration technique.

Of course, that’s true for any semantic integration technique that doesn’t preserve the original language of a data set.

I will have to dig out some record linkage software to compare to Szl.

December 11, 2010

Accidental Complexity

Filed under: Clojure,Data Mining,Software — Patrick Durusau @ 3:22 pm

Nathan Marz in Clojure at Backtype uses the term accidental complexity.

accidental complexity: Complexity caused by the tool to solve a problem rather than the problem itself

According to Nathan, Clojure helps avoid accidental complexity, something that would be useful in any semantic integration system.

The presentation is described as:

Clojure has led to a significant reduction in complexity in BackType’s systems. BackType uses Clojure all over the backend, from processing data on Hadoop to a custom database to realtime workers. In this talk Nathan will give a crash course on Clojure and using it to build data-driven systems.

Very much worth the time to view it, even more than once.

December 9, 2010

Mining of Massive Datasets – eBook

Mining of Massive Datasets

Jeff Dalton, Jeff’s Search Engine Caffè reports a new data mining book by Anand Rajaraman and Jeffrey D. Ullman (yes, that Jeffrey D. Ullman, think “dragon book.”).

A free eBook no less.

Read Jeff’s post on your way to get a copy.

Look for more comments as I read through it.

Has anyone written a comparison of the recent search engine titles? Just curious.


Update: New version out in hard copy and e-book remains available. See: Mining Massive Data Sets – Update

December 6, 2010

A Brief Survey on Sequence Classification

Filed under: Data Mining,Pattern Recognition,Sequence Classification,Subject Identity — Patrick Durusau @ 5:56 am

A Brief Survey on Sequence Classification Authors: Zhengzheng Xing, Jian Pei, Eamonn Keogh

Abstract:

Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.

Excellent survey article on sequence classification, which as the authors note, is a rapidly developing field of research.

This article was published in the “newsletter” of the ACM Special Interest Group on Knowledge Discovery and Data Mining. Far more substantive material than I am accustomed to seeing in any “newsletter.”

The ACM has very attractive student discounts and if you are serious about being an information professional, it is one of the organizations that I would recommend in addition to the usual library suspects.

December 4, 2010

Exploring Homology Using the Concept of Three-State Entropy Vector

Filed under: Bioinformatics,Biomedical,Data Mining — Patrick Durusau @ 3:24 pm

Exploring Homology Using the Concept of Three-State Entropy Vector Authors: Armando J. Pinho, Sara P. Garcia, Paulo J. S. G. Ferreira, Vera Afreixo, Carlos A. C. Bastos, António J. R. Neves, João M. O. S. Rodrigues Keywords: DNA signature, DNA coding regions, DNA entropy, Markov models

Abstract:

The three-base periodicity usually found in exons has been used for several purposes, as for example the prediction of potential genes. In this paper, we use a data model, previously proposed for encoding protein-coding regions of DNA sequences, to build signatures capable of supporting the construction of meaningful dendograms. The model relies on the three-base periodicity and provides an estimate of the entropy associated with each of the three bases of the codons. We observe that the three entropy values vary among themselves and also from species to species. Moreover, we provide evidence that this makes it possible to associate a three-state entropy vector with each species, and we show that similar species are characterized by similar three-state entropy vectors.

I include this paper both as informative for the bioinformatics crowd as well as to illustrate that subject identity tests are as varied as the subjects they identify. In this particular case, the identification of species for the construction of dendograms.

December 3, 2010

S4

S4

From the website:

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

Just in case you were wondering if topic maps are limited to being bounded objects composed of syntax. No.

Questions:

  1. Specify three sources of unbounded streams of data. (3 pages, citations)
  2. What subjects would you want to identify and on what basis in any one of them? (3-5 pages, citations)
  3. What other information about those subjects would you want to bind to the information in #2? What subject identity tests are used for those subjects in other sources? (5-10 pages, citations)

December 2, 2010

Apache Tika – a content analysis toolkit

Filed under: Authoring Topic Maps,Data Mining,Software — Patrick Durusau @ 7:57 pm

Apache Tika – a content analysis toolkit

From the website:

Apache Tika™ is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Formats include:

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document format
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

Sounds like we are getting close to pipelines for topic map production.

Comments?

November 30, 2010

Apache Mahout – Website

Filed under: Classification,Clustering,Data Mining,Mahout,Pattern Recognition,Software — Patrick Durusau @ 8:54 pm

Apache Mahout

From the website:

Apache Mahout’s goal is to build scalable machine learning libraries. With scalable we mean:

Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

Current capabilities include:

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)

A topic maps class will only have enough time to show some examples of using Mahout. Perhaps an informal group?

RDF Extension for Google Refine

Filed under: Data Mining,RDF,Software — Patrick Durusau @ 1:09 pm

RDF Extension for Google Refine

From the website:

This project adds a graphical user interface(GUI) for exporting data of Google Refine projects in RDF format. The export is based on mapping the data to a template graph using the GUI.

See my earlier post on Google Refine 2.0.

BTW, if you don’t know the folks at DERI – Digital Enterprise Research Institute take a few minutes (it will stretch into hours) to explore their many projects. (I will be doing a separate post on projects of particular interest for topic maps from DERI soon.)

November 26, 2010

The Data Science Venn Diagram – Post

Filed under: Data Mining,Education,Humor — Patrick Durusau @ 1:41 pm

The Data Science Venn Diagram by Drew Conway is a must see!

Not only is it amusing but a good way to judge the skill set needed for data science.

Balisage Contest:

Print this in color for Balisage next year.

Put diagrams on both sides of bulletin board.

Contestant and colleague have to mark the location of the contestant at the same time.

Results displayed to audience. 😉

Person who comes closest to matching the colleague’s evaluation wins a prize (to be determined).

November 25, 2010

LingPipe Blog

Filed under: Data Mining,Natural Language Processing,Text Analytics — Patrick Durusau @ 11:07 am

LingPipe Blog: Natural Language Processing and Text Analytics

Blog for the LingPipe Toolkit.

If you want to move beyond hand-authored topic maps, NLP and other techniques are in your future.

Imagine using LingPipe to generate entity profiles that you then edit (or not) and market for particular data resources.

On entity profiles, see: Sig.ma.

November 24, 2010

R-Bloggers

Filed under: Data Mining,R — Patrick Durusau @ 4:47 pm

R-Bloggers: R news contributed by (~130) R bloggers

R is “software environment for statistical computing and graphics.” (so the project page says)

Used extensively in a number of fields for data mining, exploration and display.

R-Bloggers as the name implies, is a blog site contributed to by a number of R users.

Questions:

  1. Bibliography of use of R in library projects.
  2. Use R for exploring data set for building a topic map. (Project)
  3. Bibliography of use of R in particular subject area.
« Newer PostsOlder Posts »

Powered by WordPress