Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 25, 2010

The Wavelet Tutorial

Filed under: Information Retrieval,Wavelet Transforms — Patrick Durusau @ 6:46 am

The Wavelet Tutorial

As its name implies, a tutorial on wavelet transformation.

It’s not often that you see engineers implied to be non-math types but this tutorial was written from an engineering perspective and not for “math people.” (The author’s term, not mine.)

More accessible than many of the wavelet transformation explanations I have seen so I mention it here.

Questions:

  1. What improvements would you suggest for this tutorial? (1-2 pages, no citations)
  2. What examples would you add to make it more relevant to information retrieval? (1-2 pages, no citations)
  3. Other wavelet tutorials that you have found helpful? (1-2 pages, citations/links)

Spectral Based Information Retrieval

Filed under: Information Retrieval,Retrieval,TREC,Vectors,Wavelet Transforms — Patrick Durusau @ 6:10 am

Spectral Based Information Retrieval Author: Laurence A. F. Park (2003)

Every now and again I run into a dissertation that is an interesting and useful survey of a field and an original contribution to the literature.

Not often but it does happen.

It happened in this case with Park’s dissertation.

The beginning of an interesting threat of research that treats terms in a document as a spectrum and then applies spectral transformations to the retrieval problem.

The technique has been developed and extended since the appearance of Park’s work.

Highly recommended, particularly if you are interested in tracing the development of this technique in information retrieval.

My interest is in the use of spectral representations of text in information retrieval as part of topic map authoring and its potential as a subject identity criteria.

Actually I should broaden that to include retrieval of images and other data as well.

Questions:

  1. Prepare an annotated bibliography of ten (10) recent papers usually spectral analysis for information retrieval.
  2. Spectral analysis helps retrieve documents but what if you are searching for ideas? Does spectral analysis offer any help?
  3. How would you extend the current state of spectral based information retrieval? (5-10 pages, project proposal, citations)

December 24, 2010

The OpenLink Data Explorer Extension

Filed under: Linked Data — Patrick Durusau @ 5:03 pm

The OpenLink Data Explorer Extension

Extensions for Firefox, Safari, Google Chrome to explore data underlying web pages.

Must be my lack of business experience but I would think a browser extension would start with the most widely used browser.

I think the claim that Linked Data supports the combination of heterogeneous data without programming, well I suppose that is true.

That is heterogeneous data can be combined using Linked Data, whether it will be meaningful is an entirely different question.

There are some SW based efforts to make such combinations more likely to be useful. More on those anon.

idMesh: Graph-Based Disambiguation of Linked Data

Filed under: Entity Resolution,Linked Data,Topic Maps — Patrick Durusau @ 9:21 am

idMesh: Graph-Based Disambiguation of Linked Data Authors: Philippe Cudré-Mauroux, Parisa Haghani, Michael Jost, Karl Aberer, Hermann de Meer

Abstract:

We tackle the problem of disambiguating entities on the Web. We propose a user-driven scheme where graphs of entities – represented by globally identifiable declarative artifacts – self-organize in a dynamic and probabilistic manner. Our solution has the following two desirable properties: i) it lets end-users freely define associations between arbitrary entities and ii) it probabilistically infers entity relationships based on uncertain links using constraint-satisfaction mechanisms. We outline the interface between our scheme and the current data Web, and show how higher-layer applications can take advantage of our approach to enhance search and update of information relating to online entities. We describe a decentralized infrastructure supporting efficient and scalable entity disambiguation and demonstrate the practicability of our approach in a deployment over several hundreds of machines.

Interesting paper but disappointing in that indication of equivalence between links is the only option for indicating equivalence of entities.

While that is true for Linked Data and the Semantic Web in general (see OWL:sameAs), topic maps has long supported a more robust, declarative approach.

The Topic Maps Data Model (TMDM) defines default merging for topics, but leaves open the specification of additional bases for merging.

The Topic Maps Reference Model (TMRM) does not define any merging rules at all but enables legends to make their own choices with regard to bases for merging.

The problem engendered by indicating equivalence by use of IRIs is just that, all you have is an indication of equivalence.

There is no indication of why, on what basis, etc., two (or more), IRIs are thought to indicate the same subject.

Which means there is no basis on which to compare them with other representatives for the same subject.

As well as no basis for perhaps re-using that declaration of equivalence.

December 23, 2010

U.S. SEC RSS Feeds

Filed under: Data Source — Patrick Durusau @ 3:17 pm

U.S. SEC RSS Feeds

I ran across these feeds while looking at EDGAR Guide.

Should the SEC succeed in exposing EDGAR with an API, that would be useful for combining financial industry data.

Question: Is anyone capturing the current RSS feeds into a topic map application?

Mergent Corporate Actions and Dividends API

Filed under: Data Source — Patrick Durusau @ 3:05 pm

Mergent Corporate Actions and Dividends API

From http://www.programmableableweb.com:

Provides information on corporate actions and events reported by US and Canadian publicly traded companies, including a detailed database of issued and declared dividends. The API allows access to detailed data on dividend distributions, stock splits, stock dividends, spin-offs, redemption of stock, rights, tender offers, mergers & acquisitions, bankruptcy filings, and more.

Another corporate finance information source that can be usefully combined with other information.

Speller Challenge II

Filed under: Marketing — Patrick Durusau @ 1:56 pm

After posting Speller Challenge, it occurred to me that the contest name is mis-leading.

It really isn’t a speller contest as much as it is a spelling-check contest.

That is a speller implies being able to correctly spell words in some language. A semester NLP or AI type project.

What is needed for this contest is a spelling-check that recognizes likely completions and reports for any given completion, all the variant spellings.

Deriving that solution would have two parts:

First, data mining to determine all the variants (and their frequency) for any given completion. With search logs it should be possible to keep track of variants by locale but the contest did not ask for that. Save that for a future refinement.

Second, the topic map part, would be to represent all the variants of a completion as an association. Such that the retrieval of any one completion includes pointers to all of its variant completions.

I would treat all the completions as variants with incidence/frequency values. That is they play the role of variant in a variant-of association.

Since we are talking about internal representation, I would not represent either the association type or the roles in the data structure. There isn’t any merging or interchange going on so optimize for speed of response.

Will still need to contend with completions that are members of different associations. That is the completion stands for a different subject.

In any given variant-of association, all the variants represent the same subject.

Will have to give some thought on how to distinguish identical completions that are members of different associations.

(more to follow)

ScraperWiki

Filed under: Authoring Topic Maps,Data Mining,Data Source — Patrick Durusau @ 1:47 pm

ScraperWiki

The website describes traditional screen scraping and then says:

ScraperWiki is an online tool to make that process simpler and more collaborative. Anyone can write a screen scraper using the online editor, and the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code. And, if you’re not a programmer yourself, you can request a scraper or ask the ScraperWiki team to write one for you.

Interesting way to promote the transition to accessible and structured data.

One step closer to incorporation into or being viewed by a topic map!

Forbes: R is a name you need to know in 2011 – Post

Filed under: Marketing — Patrick Durusau @ 6:15 am

Forbes: R is a name you need to know in 2011

The Revolutions blog reports that Forbes has named R as a name you need to know in 2011.

I mention that story for two reasons:

First, it illustrates the importance that the business community is placing on data mining. Usefully combining the results of data mining (can you say “topic maps?”) seems like a natural next step.

Second, it sets a high water mark for the business community becoming aware of a technology. Something for us to collectively shot for in the future.

R for topic maps anyone?

December 22, 2010

Speller Challenge

Filed under: Associations,Topic Maps — Patrick Durusau @ 7:57 pm

Speller Challenge

Microsoft Research and Bing are sponsoring a best speller contest!

From the website:

  • Important Dates
  • January 17th 2011 Registration opens
  • May 27th 2011 Challenge ends at 11:59PM PDT
  • June 17th 2011 Winners Announced
  • July 1st 2011 Camera-ready workshop paper
  • July 2011 Workshop to present results

Goal of contest:

The goal of the Speller Challenge (the “Challenge”) is to build the best speller that proposes the most plausible spelling alternatives for each search query. Spellers are encouraged to take advantage of cloud computing and must be submitted to the Challenge in the form of REST-based Web Services. At the end of the challenge, the entry that you designate as your “primary entry” will be judged according to the evaluation measures described below to determine five (5) winners of the prizes described below.

Variant spellings seems like a natural application for topic maps.

Not by use of variant name, although that might work.

I was thinking more along the lines of associations.

I am curious how to model different sort orders for spellings for any single term.

Reasoning that presentation of spelling choices should vary depending on geographic location or similar data.

HyperGraphDB – Data Management for Complex Systems

Filed under: Hypergraphs,NoSQL — Patrick Durusau @ 1:36 pm

HyperGraphDB – Data Management for Complex Systems Author: Borislav Iordanov

Presentation on the architecture of HyperGraphDB.

Slides and MP3 file are available at the presentation link.

Covers the architecture of HyperGraphDB in just under 20 minutes.

Good for an overview but I would suggest looking at the documentation, etc. for a more detailed view.

The documentation describes its topic map component in part as:

In HGTM, all topic maps constructs are represented as HGDB atoms. The Java classes implementing those atoms are in the package org.hypergraphdb.apps.tm. The API is an almost complete implementation of the 1.0 specification. Everything except merging is implementing. Merging wouldn’t be hard, but I haven’t found the need for it yet.

I will be following up with the HyperGraphDB project on how merging was understood.

Will report back on what comes of that discussion.

Reading Tea Leaves: How Humans Interpret Topic Models

Filed under: Latent Dirichlet Allocation (LDA),Topic Models (LDA) — Patrick Durusau @ 9:12 am

Reading Tea Leaves: How Humans Interpret Topic Models Authors: Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei

Abstract:

Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.

Read the article first but then see the LingPipe Blog review of the same.

December 21, 2010

Topic Models (warning – not what you may think)

Filed under: Latent Dirichlet Allocation (LDA),Topic Models — Patrick Durusau @ 5:06 pm

Topic Models Authors: David M. Blei, John D. Lafferty

The term topic and topic model refer to sets of highly probable words that are found to characterize a text.

The sets of words are called topics.

A topic model is the technique applied to a set of texts to extract topics.

In this particular article, latent Dirichlet allocation (LDA).

Apologies for the mis-use of the term topic but I suspect if we looked closely, someone was using the term topic before ISO 13250.

Good introduction to topic models.

Dirichlet Processes: Tutorial and Practical Course

Filed under: Bayesian Models,Dirichlet Processes — Patrick Durusau @ 4:50 pm

Dirichlet Processes: Tutorial and Practical Course Author: Yee Whye Teh
Slides
Paper

Abstract:

The Bayesian approach allows for a coherent framework for dealing with uncertainty in machine learning. By integrating out parameters, Bayesian models do not suffer from overfitting, thus it is conceivable to consider models with infinite numbers of parameters, aka Bayesian nonparametric models. An example of such models is the Gaussian process, which is a distribution over functions used in regression and classification problems. Another example is the Dirichlet process, which is a distribution over distributions. Dirichlet processes are used in density estimation, clustering, and nonparametric relaxations of parametric models. It has been gaining popularity in both the statistics and machine learning communities, due to its computational tractability and modelling flexibility.

In the tutorial I shall introduce Dirichlet processes, and describe different representations of Dirichlet processes, including the Blackwell-MacQueen? urn scheme, Chinese restaurant processes, and the stick-breaking construction. I shall also go through various extensions of Dirichlet processes, and applications in machine learning, natural language processing, machine vision, computational biology and beyond.

In the practical course I shall describe inference algorithms for Dirichlet processes based on Markov chain Monte Carlo sampling, and we shall implement a Dirichlet process mixture model, hopefully applying it to discovering clusters of NIPS papers and authors.

With the last two posts, that is almost 8 hours of video for streaming to your new phone or other personal device.

That should get you past even a Christmas day sports marathon at your in-laws house (or your own should they be visiting).

Bayesian inference and Gaussian processes – In six (6) parts

Filed under: Bayesian Models,Guassian Processes — Patrick Durusau @ 4:45 pm

Bayesian inference and Gaussian processes Authors: Carl Edward Rasmussen

Quite useful as the presenter concludes with disambiguating terminology used differently in the field. Same terms used to mean different things, different terms to mean the same thing. Hmmm, that sounds really familiar. 😉

Start with this lecture before Dirichlet Processes: Tutorial and Practical Course

BTW, if this seems a bit AI-ish, consider it to be the reverse of supervised classification (person helps machine), that is machine helps person, but the person should say when answer is correct.

Graphical Models

Filed under: Bayesian Models,Dirichlet Processes,Graphical Models,Inference — Patrick Durusau @ 4:12 pm

Graphical Models Author: Zoubin Ghahramani

Abstract:

An introduction to directed and undirected probabilistic graphical models, including inference (belief propagation and the junction tree algorithm), parameter learning and structure learning, variational approximations, and approximate inference.

  • Introduction to graphical models: (directed, undirected and factor graphs; conditional independence; d-separation; plate notation)
  • Inference and propagation algorithms: (belief propagation; factor graph propagation; forward-backward and Kalman smoothing; the junction tree algorithm)
  • Learning parameters and structure: maximum likelihood and Bayesian parameter learning for complete and incomplete data; EM; Dirichlet distributions; score-based structure learning; Bayesian structural EM; brief comments on causality and on learning undirected models)
  • Approximate Inference: (Laplace approximation; BIC; variational Bayesian EM; variational message passing; VB for model selection)
  • Bayesian information retrieval using sets of items: (Bayesian Sets; Applications)
  • Foundations of Bayesian inference: (Cox Theorem; Dutch Book Theorem; Asymptotic consensus and certainty; choosing priors; limitations)

Start with this lecture before Dirichlet Processes: Tutorial and Practical Course

Language Pyramid and Multi-Scale Text Analysis

Filed under: Bag-of-Words (BOW),Language Pyramid (LaP) — Patrick Durusau @ 3:22 pm

Language Pyramid and Multi-Scale Text Analysis Authors: Shuang-Hong Yang, Hongyuan Zha Keywords: bag of word, language pyramid, multi-scale language models, multi-scale text analysis, multi-scale text kernel, text spatial contents modeling

Abstract:

The classical Bag-of-Word (BOW) model represents a document as a histogram of word occurrence, losing the spatial information that is invaluable for many text analysis tasks. In this paper, we present the Language Pyramid (LaP) model, which casts a document as a probabilistic distribution over the joint semantic-spatial space and motivates a multi-scale 2D local smoothing framework for nonparametric text coding. LaP efficiently encodes both semantic and spatial contents of a document into a pyramid of matrices that are smoothed both semantically and spatially at a sequence of resolutions, providing a convenient multi-scale imagic view for natural language understanding. The LaP representation can be used in text analysis in a variety of ways, among which we investigate two instantiations in the current paper: (1) multi-scale text kernels for document categorization, and (2) multi-scale language models for ad hoc text retrieval. Experimental results illustrate that: for classification, LaP outperforms BOW by (up to) 4% on moderate-length texts (RCV1 text benchmark) and 15% on short texts (Yahoo! queries); and for retrieval, LaP gains 12% MAP improvement over uni-gram language models on the OHSUMED data set.

The text that stands out for me reads:

More pessimistically, different concepts usually stands at different scales, making it impossible to capture all the right meanings with a single scale. For example, named entities usually range from unigram (e.g., “new”) to bigram (e.g., “New York”), to multigram (e.g., “New York Times”), and even to a whole long sequence (e.g., a song name “ Another Lonely Night In New York”).

OK, so if you put it that way, then BOW (bag-of-words) is clearly not the best idea.

But we already treat text at different scales.

All together now: markup!

I checked the documentation on the latest Marklogic release and found:

By default, MarkLogic Server assumes that any XML element constructor acts as a phrase boundary.
(Administrator’s Guide, 4.2, page 224)

Can someone supply the behavior for some of the other XML indexing engines? Thanks!

December 20, 2010

When OCR Goes Bad: Google’s Ngram Viewer & The F-Word – Post

Filed under: Humor,Indexing,Search Engines — Patrick Durusau @ 7:24 pm

When OCR Goes Bad: Google’s Ngram Viewer & The F-Word

Courtesy of http://searchengineland.com‘s Danny Sullivan, a highly amusing post on Google’s Ngram viewer.

Danny’s post only covers changing spelling and character rendering but serves to illustrate that the broader the time period covered, the greater the care needed to have results that make any sense at all.

Quite the post for the holidays!

A Graph Processing Stack

Filed under: Graphs — Patrick Durusau @ 2:31 pm

A Graph Processing Stack Author: Marko Rodriguez

An introduction to a graph processing stack by one of the leaders in the area of graph processing.

Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation

Filed under: Latent Dirichlet Allocation (LDA),Named Entity Mining,NEM,WS-LDA — Patrick Durusau @ 6:19 am

Named Entity Mining from Click-Through Data Using Weakly Supervised Latent Dirichlet Allocation (video) Authors: Shuang-Hong Yang, Gu Xu, Hang Li slides KDD ’09 paper

Abstract:

This paper addresses Named Entity Mining (NEM), in which we mine knowledge about named entities such as movies, games, and books from a huge amount of data. NEM is potentially useful in many applications including web search, online advertisement, and recommender system. There are three challenges for the task: finding suitable data source, coping with the ambiguities of named entity classes, and incorporating necessary human supervision into the mining process. This paper proposes conducting NEM by using click-through data collected at a web search engine, employing a topic model that generates the click-through data, and learning the topic model by weak supervision from humans. Specifically, it characterizes each named entity by its associated queries and URLs in the click-through data. It uses the topic model to resolve ambiguities of named entity classes by representing the classes as topics. It employs a method, referred to as Weakly Supervised Latent Dirichlet Allocation (WS-LDA), to accurately learn the topic model with partially labeled named entities. Experiments on a large scale click-through data containing over 1.5 billion query-URL pairs show that the proposed approach can conduct very accurate NEM and significantly outperforms the baseline.

With some slight modifications, almost directly applicable to the construction of topic maps.

Questions:

  1. What presumptions underlie the use of supervision to assist with Named Entity Mining? (2-3 pages, no citations)
  2. Are those valid presumptions for click-through data? (2-3 pages, no citations)
  3. How would you suggest investigating the characteristics of click-through data? (2-3 pages, no citations)

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval Author: Laurence A. F. Park, The University of Melbourne. slides

Abstract:

It has been shown that the use of topic models for Information retrieval provides an increase in precision when used in the appropriate form. Latent Dirichlet Allocation (LDA) is a generative topic model that allows us to model documents using a Dirichlet prior. Using this topic model, we are able to obtain a fitted Dirichlet parameter that provides the maximum likelihood for the document set. In this article, we examine the sensitivity of LDA with respect to the Dirichlet parameter when used for Information retrieval. We compare the topic model computation times, storage requirements and retrieval precision of fitted LDA to LDA with a uniform Dirichlet prior. The results show there there is no significant benefit of using fitted LDA over the LDA with a constant Dirichlet parameter, hence showing that LDA is insensitive with respect to the Dirichlet parameter when used for Information retrieval.

Note that topic is used in semantic analysis (of various kinds) to mean highly probable words and not in the technical sense of the TMDM or XTM.

Extraction of highly probably words from documents can be useful in the construction of topic maps for those documents.

Building blocks of a scalable webcrawler

Filed under: Indexing,Search Engines,Webcrawler — Patrick Durusau @ 4:41 am

Building blocks of a scalable webcrawler

From Marc Seeger’s post about his thesis:

This thesis documents my experiences trying to handle over 100 million sets of data while keeping them searchable. All of that happens while collecting and analyzing about 100 new domains per second. It covers topics from the different Ruby VMs (JRuby, Rubinius, YARV, MRI) to different storage-backend (Riak, Cassandra, MongoDB, Redis, CouchDB, Tokyo Cabinet, MySQL, Postgres, …) and the data-structures that they use in the background.

Questions:

  1. What components would need to be added to make this a semantic crawling project? (3-5 pages, citations)
  2. What scalability issues would semantic crawling introduce? (3-5 pages, citations)
  3. Design a configurable, scalable, semantic crawler. (Project)

December 19, 2010

ASPECT Vocabulary Bank for Education API

Filed under: Data Source,Education — Patrick Durusau @ 2:04 pm

ASPECT Vocabulary Bank for Education API

From the http://www.programmableweb.com website:

The ASPECT Vocabulary Bank for Education (VBE) provides both a browsable and searchable web application for users to locate, view and download terminology, as well as standards-based machine to machine interfaces. The VBE provides a range of multilingual, controlled lists relevant to learning in the EU, including those that are used to validate metadata profiles and a thesaurus used to describe educational topics. The RESTful API allows users to interface with the VBE.

The EU is a group that realizes not all users speak the same language.

OWL, Ontologies, Formats, Punch Cards, Oh My!

Filed under: Linked Data,OWL,Semantic Web,Topic Maps — Patrick Durusau @ 2:03 pm

Edwin Black’s IBM and the Holocaust reports that one aspect of the use of IBM punch card technology by the Nazis (and others) was the monopoly that IBM maintained on the manufacture of the punch cards.

The IBM machines could only use IBM punch cards.

The IBM machines could only use IBM punch cards.

The repetition was intentional. Think about that statement in a more modern context.

When we talk about Linked Data, or OWL, or Cyc, or SUMO, etc. (yes, I am aware that I am mixing formats and ontologies), isn’t that the same thing?

They are not physical monopolies like IBM punch cards but rather are intellectual monopolies.

Say it this way (insert your favorite format/ontology) or you don’t get to play.

I am sure that meets the needs of software designed to work on with particular formats or ontologies.

But that isn’t the same thing as representing user semantics.

Note: Representing user semantics.

Not semantics as seen by the W3C or SUMO or Cyc or (insert your favorite group) or even XTM Topic Maps.

All of those quite usefully represent some user semantics.

None of them represent all user semantics.

No, I am not going to argue there is a non-monopoly solution.

To successfully integrate (or even represent) data, choices have to be made and those will result in a monopoly.

My caution it is to not mistake the lip of the teacup that is your monopoly for the horizon of the world.

Very different things.

*****
PS: Economic analysis of monopolies could be useful when discussing intellectual monopolies. The “products” are freely available but the practices have other characteristics of monopolies. (I have added a couple of antitrust books to my Amazon.com wish list should anyone feel moved to contribute.)

December 18, 2010

Mergent Historical Securities API

Filed under: Data Source — Patrick Durusau @ 4:50 pm

Mergent Historical Securities API

From http://www.programmableableweb.com:

Provides historical quotes for all US and Canadian stocks, indices, mutual funds, OTC bulletin board issues, and other securities. Prices are fully adjusted for splits, dividends and other corporate actions. Data on both living and dead issues is available. The API also provides access to non-price data such as short interest, shares outstanding, earnings per share and more.

This API along with one of the news archives could make an interesting exercise for high school or even college students.

What news events affect stock prices?

Do events of the same nature have the same impact?

KNIME Version 2.3.0 released – News

Filed under: Heterogeneous Data,Mapping,Software,Subject Identity — Patrick Durusau @ 12:48 pm

KNIME Version 2.3.0 released

From the announcement:

The new version is a greatly enhancing the usability of KNIME. It adds new features like workflow annotations, support for hotkeys, inclusion of R-views in reports, data flow switches, option to hide node labels, variable support in the database reader/connector and R-nodes, and the ability to export KNIME workflows as SVG Graphics.

With the 2.3 release we are also introducing a community node repository, which includes KNIME extensions for bio- and chemoinformatics and an advanced R-scripting environment.

Data trails reconstruction at the community level in the Web of data – Presentation

Filed under: Co-Words,Data Mining,Subject Identity — Patrick Durusau @ 9:30 am

David Chavalarias: Video from SOKS: Self-Organising Knowledge Systems, Amsterdam, 29 April 2010.

Abstract:

Socio-semantic networks continuously produce data over the Web in a time consistent manner. From scientific communities publishing new findings in archives to citizens confronting their opinions in blogs, there is a real challenge to reconstruct, at the community level, the data trails they produce in order to have a global representation of the topics unfolding in these public arena. We will present such methods of reconstruction in the framework of co-word analysis, highlighting perspectives for the development of innovative tools for our daily interactions with their productions.

I wasn’t able to get very good sound quality for this presentation and there were no slides. However, I was interested enough to find the author’s home page: David Chavalarias and a wealth of interesting material.

I will be watching his projects for some very interesting results and suggest that you do the same.

Fuzzy Logic – Tutorial

Filed under: Fuzzy Logic — Patrick Durusau @ 8:03 am

Fuzzy Logic Author: Michael Berthold, Department of Computer and Information Science, University of Konstanz

Description:

The tutorial will introduce the basics of fuzzy logic for data analysis. Fuzzy Logic can be used to model and deal with imprecise information, such as inexact measurements or available expert knowledge in the form of verbal descriptions. We will first introduce the concepts of fuzzy sets, degrees of membership and fuzzy set operators. After discussions on fuzzy numbers and arithmetic operations using them, the focus will shift to fuzzy rules and how such systems of rules can be derived from available data.

Subject identity isn’t always crisp. Fuzzy logic offers a way to deal with those situations.

Self-organization in Distributed Semantic Repositories – Presentation

Filed under: Clustering,Self-organization,Similarity — Patrick Durusau @ 6:19 am

Kia Teymourian Video, Slides from SOKS: Self-Organising Knowledge Systems, Amsterdam, 29 April 2010

Abstract:

Principles from nature-inspired selforganization can help to attack the massive scalability challenges in future internet infrastructures. We researched into ant-like mechanisms for clustering semantic information. We outline algorithms to store related information within clusters to facilitate efficient and scalable retrieval.

At the core are similarity measures that cannot consider global information such as a completely shared ontology. Mechanisms for syntax-based URI-similarity and the usage of a dynamic partial view on an ontology for path-length based similarity are described and evaluated. We give an outlook on how to consider application specific relations for clustering with a usecase in geo-information systems.

Research questions:

  1. What about a similarity function where “sim = 1.0?”
  2. What about ants with different similarity functions?
  3. The similarity measure is RDF bound. What other similarity measures are in use?

Observation: The Wordnet Ontology is used for the evaluation. It occurred to me that Wordnet gets used a lot, but never reused. Or rather, the results of using Wordnet are never reused.

Isn’t it odd that we keep reasoning about sparrows being like ducks, over and over again? Seems like we should be able to take the results of others and build upon them. What prevents that from happening? Either in searching or ontology systems.

« Newer PostsOlder Posts »

Powered by WordPress