Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 4, 2010

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

Filed under: Authoring Topic Maps,Clustering,Data Mining — Patrick Durusau @ 11:26 am

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise (1996) Authors: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu Keywords: Clustering Algorithms, Arbitrary Shape of Clusters, Efficiency on Large Spatial Databases, Handling Noise.

Before you decide to skip this paper as “old” consider that it has > 600 citations in CiteSeer.

Abstract:

Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLAR-ANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

Discovery of classes is always an issue in topic map authoring/design and clustering is one way to find classes, perhaps even ones you did not suspect existed.

Indiana University – Bioinformatics

Filed under: Bioinformatics,Biomedical,Information Retrieval — Patrick Durusau @ 10:37 am

Indiana University – Bioinformatics

The Research & Projects Page offers a sampling of the work underway.

Subject Identification Patterns

Filed under: Authoring Topic Maps,Subject Identifiers,Subject Identity,Subject Locators — Patrick Durusau @ 10:27 am

Does that sound like a good book title?

Thinking that since everyone is recycling old stuff under the patterns rubric that topic maps may as well jump on the bandwagon.

Instead of the three amigos (was that a movie?) we could have the dirty dozen honchos (or was that another movie?). I don’t get out much these days so I would probably need some help with current cultural references.

This ties into Lars Heuer’s effort to distinguish between Playboy Playmates and Astronauts, while trying to figure out why birds keep, well, let’s just say he has to wash his hair a lot.

When you have an entry from DBpedia, what do you have to know to identify it? Its URI is one thing but I rarely encounter URIs while shopping. (Or playmates for that matter.)

Is 303 Really Necessary? – Blog Post

Filed under: Linked Data,RDF,Semantic Web,Uncategorized — Patrick Durusau @ 9:46 am

Is 303 Really Necessary?.

Ian Davis details at length why 303’s are unnecessary and offers an interesting alternative.

Read the comments as well.

November 3, 2010

Aperture: a Java framework for getting data and metadata

Filed under: Data Mining,Software — Patrick Durusau @ 7:07 pm

Aperture: a Java framework for getting data and metadata

From the website:

Aperture is an open source library for crawling and indexing information sources such as file systems, websites and mail boxes. Aperture supports a number of common source types and document formats out-of-the-box and provides easy ways to extend it with custom implementations.

Aperture wiki

Example applications include:

  • bibsonomycrawler.bat – crawls Bibsonomy accounts, extracts bookmarks and tags
  • deliciouscrawler.bat – crawls delicious accounts, extracts bookmarks and tags
  • filecrawler.bat – crawls filesystems, extracts the folder structure, the file metadata and the file content
  • flickrcrawler.bat – crawls flickr accounts, extracts tags, and photos metadata
  • icalcrawler.bat – crawls calendars stored in the well-known iCalendar format, extracts events, todos, journal entires etc.
  • imapcrawler.bat – crawls remote mailboxes accessible with IMAP
  • mboxcrawler.bat – crawls local mailboxes stored in mbox-format files (e.g. those from thunderbird)
  • outlookcrawler.bat – makes a connection with the outlook instance and crawls appointments, contacts and emails, note that this crawler will obviously only work in Windows if the MS Outlook is installed
  • thunderbirdcrawler.bat – crawls a thunderbird addressbook, extracts contacts, note that for crawling emails – use the mboxcrawler
  • webcrawler.bat – crawls websites

More tools for your topic map toolbox!

Managing Semantic Ambiguity

Filed under: Legends,TMDM,Topic Maps — Patrick Durusau @ 6:52 pm

Topic maps do not and cannot eliminate semantic ambiguity. What topic maps can do is assist users in managing semantic ambiguity with regard to identification of particular subjects.

Consider the well-known ambiguity of whether a URI is an identifier or an address.

The Topic Maps Data Model (TMDM) provides a way to manage that ambiguity by providing a means to declare if a URI is being used and an identifier or as an address.

That is only “managing” the ambiguity because there is no mechanism to prevent incorrect use of that mechanism, which would result in ambiguity or even having the mechanism mis-understood entirely.

Identification by saying a subject representative (proxy) must have properties X…Xn is a collection of possible ambiguities that an author hopes will be understood by a reader.

Since we are trying to communicate with other people, there isn’t any escape from semantic ambiguity. Ever.

Topic maps provide the ability to offer more complete descriptions of subjects in hopes of being understood by others.

With the ability to add descriptions of subjects from others, offering users a variety of descriptions of the same subject.

We have had episodic forays into “certainty,” the Semantic Web being only one of the more recent failures in that direction. Ambiguity anyone?

The Semantic Web Garden of Eden

Filed under: Marketing,RDF,Semantic Web,Topic Maps — Patrick Durusau @ 6:48 pm

The Garden of Eden:

[2:19] And out of the ground the LORD God formed every beast of the field, and every fowl of the air; and brought them unto Adam to see what he would call them: and whatsoever Adam called every living creature, that was the name thereof….[1]

As the number of Adams and Eves multiplied, so did the names of things.

Multiple names for the same things, different things with the same names.

Ambiguity had entered the world.

The Semantic Web Garden of Eden sought to banish ambiguity:

…by an RDF statement having…URIrefs are used to identify not only the subject of the original statement, but also the predicate and object, instead of using the words “creator” and “John Smith” [2]

As the number of URIs multipled, so did the URIs of things.

Multiple URIs for the same things, different things with the same URIs.

Ambiguity remains in the world.

******
[1] Genesis 2:19
[2] RDF Primer, 2.2 RDF Model, http://www.w3.org/TR/rdf-primer/

Weaknesses In Linked Data

Filed under: Linked Data,RDF,Semantic Web — Patrick Durusau @ 6:47 pm

A Partnership between Structured Data and Ontotext to address weaknesses in linked data framed it this way:

Volumes of linked data on the Web are growing. This growth is exposing three key weaknesses:

  1. inadequate semantics for how to link disparate information together that recognizes inherently different contexts and viewpoints and (often) approximate mappings
  2. misapplication of many linking predicates, such as owl:sameAs, and
  3. a lack of coherent reference concepts by which to aggregate and organize this linkable content.

The amount of linked data is trivial compared to the total volume of digital data.

Makes me wonder about the “only the web will scale argument.”

Questions:

  1. How do these three “key weaknesses” compared to current barriers to semantic integration? (3-5 pages, no citations)
  2. “inadequate semantics?” What’s wrong with the semantics we have now? Or is the point that formal semantics are inadequate? (discussion)
  3. “coherent reference concepts?” How would you recognize one if you saw it? (3-5 pages, no citations)

November 2, 2010

Healthcare Terminologies and Classification: Essential Keys to Interoperability

Filed under: Biomedical,Health care,Medical Informatics — Patrick Durusau @ 6:53 am

Healthcare Terminologies and Classification: Essential Keys to Interoperability published by the American Medical Informatics Association and the American Health Information Management Association is a bit dated (2007) but is still a good overview of the area.

Questions:

  1. What are the major initiatives on interoperability of healthcare terminologies today?
  2. What are the primary resources (web/print) for one of those initiatives?
  3. Prepare a one page abstract for each of five articles on one of these initiatives.

A Prototype of Multimedia Metadata Management System for Supporting the Integration of Heterogeneous Sources

Filed under: Heterogeneous Data,MPEG-7 — Patrick Durusau @ 6:29 am

A Prototype of Multimedia Metadata Management System for Supporting the Integration of Heterogeneous Sources Authors: Tie Hua Zhou, Byeong Mun Heo, Ling Wang, Yang Koo Lee, Duck Jin Chai and Keun Ho Ryu Keywords: Multimedia Metadata Management Systems, Metadata, MPEG-7, TV-Anytime

Abstract:

With the advances in information technology, the amount of multimedia metadata captured, produced, and stored is increasing rapidly. As a consequence, multimedia content is widely used for many applications in today’s world, and hence, a need for organizing multimedia metadata and accessing it from repositories with vast amount of information has been a driving stimulus both commercially and academically. MPEG-7 is expected to provide standardized description schemes for concise and unambiguous content description of data/documents of complex multimedia types. Meanwhile, other metadata or description schemes, such as Dublin Core, XML, TV-Anytime etc., are becoming popular in different application domains. In this paper, we present a new prototype Multimedia Metadata Management System. Our system is good at sharing the integration of multimedia metadata from heterogeneous sources. This system enables the collection, analysis and integration of multimedia metadata semantic description from some different kinds of services. (UCC, IPTV, VOD and Digital TV et al.)

The details for the “Metadata Analyzer” and “Metadata Mapping” seep to be a bit sparse (as in non-existent) for a “prototype…supporting integration of heterogeneous sources.”

MPEG-7 has an important role to play in this area and topic mappers should be aware of it.

I will try to locate more useful resources on MPEG-7 and multimedia content.

Advances in Clustering Search

Filed under: Search Engines,Searching — Patrick Durusau @ 6:04 am

Advances in Clustering Search Authors: Tarcisio Souza Costa, Alexandre César Muniz Oliveira, Luiz Antonio Nogueira Lorena Keywords: Clustering Search, search subspaces, combinatorial optimisation, population metaheuristics, evolutionary algorithms.

Abstract:

The Clustering Search (*CS) has been proposed as a generic way of combining search metaheuristics with clustering to detect promising search areas before applying local search procedures. The clustering process may keep representative solutions associated to different search subspaces. Although, recent applications have reached success in combinatorial optimisation problems, nothing new has arisen concerning diversification issues when population metaheuristics, as evolutionary algorithms, are being employed. In this work, recent advances in the *CS are commented and new features are proposed, including, the possibility of keeping population diversified for more generations.

The use of different search strategies for clusters of interest was the most interesting aspect of this article.

Not that big of a step to imagine different subject or association recognition routines depending upon the type of cluster.

Afghanistan War Diary – Improvements

Filed under: Authoring Topic Maps,Maiana,Topic Map Software,Topic Maps — Patrick Durusau @ 5:25 am

It was only days after the release of the Afghanistan War Diary that Aki Kivela posted improvements to it using automatic extractors.

Important not only as a demonstration of participation in a topic maps project but also the incremental nature of topic map authoring.

Afghanistan War Diary

Filed under: Authoring Topic Maps,Data Source,Maiana,Topic Maps — Patrick Durusau @ 5:15 am

Afghanistan War Diary.

A portion of the Afghanistan war documents published by Wikileaks as a topic map.

The release is an automatic conversion to a topic map so does not reflect the nuances that human authoring brings to a topic map.

QuaaxTM – PHP Topic Maps – New Release

Filed under: Authoring Topic Maps,Topic Map Software — Patrick Durusau @ 5:00 am

QuaaxTM – PHP Topic Maps has a new release!

Added support for XTM 2.1 read/write.

November 1, 2010

Questions: n-n1 pages, no citations

Filed under: Class Admin — Patrick Durusau @ 4:50 pm

Questions: n-n1 pages, no citations

A quick word about my entries that read:

Questions: n-n1 pages, no citations

No citations: Spend your time developing your analysis, not looking up someone else’s.

You may miss some points, but in developing your own, you may find points others have overlooked.

Comparing your analysis to that of others, after developing your own, will sharpen your analytical skills.

Writing up your analysis will improve your writing/communication skills.

Caveat: This will be some of the hardest writing of your academic career.

You cannot hide behind citations, arguments from authority, quotations, excessive use of adjectives, etc.

On the other hand, I think you will surprise yourself at how clearly you can express your own analysis with a little practice.

Rule Markup Initiative

Filed under: RDF,RuleML,Semantic Web — Patrick Durusau @ 4:48 pm

Rule Markup Initiative

From the website:

The RuleML Initiative is an international non-profit organization covering all aspects of Web rules and their interoperation, with a Structure and Technical Groups that center on RuleML specification, tool, and application development. Around RuleML, an open network of individuals and groups from both industry and academia has emerged, having a shared interest in modern rule topics, including the interoperation of Semantic Web rules. The RuleML Initiative has been collaborating with OASIS on Legal XML, Policy RuleML, and related efforts since 2004. The Initiative has further been interacting with the developers of ISO Common Logic (CL), which became an International Standard, First edition, in October 2007. RuleML is also a member of OMG, contributing to its Semantics of Business Vocabulary and Business Rules (SBVR), which went into Version 1.0 in January 2008, and to its Production Rule Representation (PRR), which went into Version 1.0 in December 2009. Moreover, participants of the RuleML Initiative have supported the development of the W3C Rule Interchange Format (RIF), which attained Recommendation status in June 2010. The annual RuleML Symposium has taken the lead in bringing together delegates from industry and academia who share this interest focus in Web rules.

Questions:

  1. Does the use of ISO Common Logic insure interoperability? Why/Why not? (discussion)
  2. How would you define interoperability? (3-5 pages, no citations)
  3. Can rules insure your definition of interoperability? (discussion)
  4. Are rules subject to the same semantic drift as data? Why/Why not?(3-5 pages, no citations)

Introduction to Graphical Models for Data Mining

Filed under: Data Mining,Graphs,Machine Learning — Patrick Durusau @ 4:32 pm

Introduction to Graphical Models for Data Mining by Arindam Banerjee, Department of Computer Science and Engineering, University of Minnesota.

Abstract:

Graphical models for large scale data mining constitute an exciting development in statistical data analysis which has gained significant momentum in the past decade. Unlike traditional statistical models which often make `i.i.d.’ assumptions, graphical models acknowledge dependencies among variables of interest and investigate inference/prediction while taking into account such dependencies. In recent years, latent variable Bayesian networks, such as latent Dirichlet allocation, stochastic block models, Bayesian co-clustering, and probabilistic matrix factorization techniques have achieved unprecedented success in a variety of application domains including topic modeling and text mining, recommendation systems, multi-relational data analysis, etc. The tutorial will give a broad overview of graphical models, and discuss recent developments in the context of mixed-membership models, matrix analysis models, and their generalizations. The tutorial will present a balanced mix of models, inference/learning methods, and applications.

Slides (pdf)
Slides (ppt)

If you plan on using data mining as a source for authoring topic maps, graphical models are on your reading list.

Questions:

  1. Would you use the results of a Bayesian network to author an entry in a topic map? Why/why not? (2-3 pages, no citations)
  2. Would you use the results of a Bayesian network to author an entry in a library catalog? Why/why not? (2-3 pages, no citations)
  3. Do we attribute certainty to library catalog entries that are actually possible entries for a particular item? (discussion question)
  4. Examples of the use of Bayesian networks in classification for library catalogs?

American Medical Informatics Association

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:31 pm

American Medical Informatics Association

From the website:

AMIA is dedicated to promoting the effective organization, analysis, management, and use of information in health care in support of patient care, public health, teaching, research, administration, and related policy. AMIA’s 4,000 members advance the use of health information and communications technology in clinical care and clinical research, personal health management, public health/population, and translational science with the ultimate objective of improving health.

For over thirty years the members of AMIA and its honorific college, the American College of Medical Informatics (ACMI), have sponsored meetings, education, policy and research programs. The federal government frequently calls upon AMIA as a source of informed, unbiased opinions on policy issues relating to the national health information infrastructure, uses and protection of personal health information, and public health considerations, among others.

Learning the terminology and concerns of an area is the first step towards successful development/application of topic maps.

Questions:

  1. Review the latest four issues of the Journal of the American Medical Informatics Association. (JAMIA)
  2. Select one article with issues that could be addressed by use of a topic map.
  3. How would you use a topic map to address those issues? (3-5 pages, no citations other than the article in question)
  4. Select one article with issues that would be difficult or cannot be addressed using a topic map.
  5. Why would a topic map be difficult to use or cannot address the issues in the article? (3-5 pages, no citations other than the article in question)

Medical Informatics – Formal Training

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:30 pm

Medical Informatics – Formal Training

A listing of formal training opportunities in medical informatics.

Understanding the current state of medical informatics is the starting point for offering topic map based services in health or medical areas.

« Newer Posts

Powered by WordPress