Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 18, 2010

The Positive Matching Index: A new similarity measure with optimal characteristics

Filed under: Binary Distance,Similarity,Subject Identity — Patrick Durusau @ 7:57 am

The Positive Matching Index: A new similarity measure with optimal characteristics Authors: Daniel Andrés Dos Santosa, Reena Deutsch Keywords: Binary datam, Association coefficient, Jaccard index, Dice index, Similarity

Abstract:

Despite the many coefficients accounting for the resemblance between pairs of objects based on presence/absence data, no one measure shows optimal characteristics. In this work the Positive Matching Index (PMI) is proposed as a new measure of similarity between lists of attributes. PMI fulfills the Tulloss’ theoretical prerequisites for similarity coefficients, is easy to calculate and has an intrinsic meaning expressable into a natural language. PMI is bounded between 0 and 1 and represents the mean proportion of positive matches relative to the size of attribute lists, ranging this cardinality continuously from the smaller list to the larger one. PMI behaves correctly where alternative indices either fail, or only approximate to the desirable properties for a similarity index. Empirical examples associated to biomedical research are provided to show out performance of PMI in relation to standard indices such as Jaccard and Dice coefficients.

An index for people who don’t think a single measure for identity (URIs) is enough, say those in the natural sciences?

November 17, 2010

RDA: Resource Description and Access

Filed under: Cataloging,Classification,RDA,Subject Identity,Topic Maps — Patrick Durusau @ 11:06 am

RDA: Resource Description and Access

From the website:

RDA: Resource Description and Access is the new standard for resource description and access designed for the digital world. Built on the foundations established by AACR2, RDA provides a comprehensive set of guidelines and instructions on resource description and access covering all types of content and media. (emphasis in original)

In case you are interested in the draft of 2008 version, just to get the flavor of it, see: http://www.rdatoolkit.org/constituencyreview.

More to follow on RDA and topic maps.

Hard-Coding Bias in Google “Algorithmic” Search Results

Filed under: Access Points,Subject Headings,Subject Identity,Topic Maps — Patrick Durusau @ 7:31 am

Hard-Coding Bias in Google “Algorithmic” Search Results.

Not that I want to get into analysis of hard-coding or not in search results but it is an interesting lead into issues a bit closer to home.

To what extent does subject identification have built-in biases that impact user communities?

Or less abstractly, how would we go about discovering and perhaps countering such bias?

For countering the bias you can guess that I would suggest topic maps. 😉

The more pressing question is and one that is relevant to topic map design, is how to discover our own biases?

What seems perfectly natural to me, with a background in law, biblical studies, networking technologies, markup technologies, and now semantic technologies, may seem so to other users.

To make matters worse, how do you ask a user about information they did not find?

Questions:

  1. How would you survey users to discover biases in subject identification? (3-5 pages, no citations)
  2. How would you discover what information users did not find? (3-5 pages, no citations)
  3. Class project: Design and test a survey for bias in a particular subject identification. (assuming permission from a library)

PS: There are biases in algorithms as well but we will cover those separately.

November 16, 2010

Reducing Ambiguity, LOD, Ookaboo, TMRM

Filed under: Ambiguity,Subject Identity,TMRM,Topic Maps — Patrick Durusau @ 9:25 pm

While reading Resource Identity and Semantic Extensions: Making Sense of Ambiguity and In Defense of Ambiguity it occurred to me that reducing ambiguity has a hidden assumption.

That hidden assumption is the intended audience for who I wish to reduce ambiguity.

For example, Ookaboo does #it solves the problem of multiple vocabularies for its intended audience thusly:

Our strategy for dealing with multiple subject terminologies is to what we call a reference set, which in this case is

http://ookaboo.com/o/pictures/topic/2021903/Central_Air_Force_Museum#it
http://dbpedia.org/resource/Central_Air_Force_Museum
http://rdf.freebase.com/ns/m.0g_2bv

If we want to assert foaf:depicts we assert foaf:depicts against all of these. The idea is that not all clients are going to have the inferencing capabilities that I wish they’d have, so I’m trying to assert terms in the most “core” databases of the LOD cloud.

In a case like this we may have YAGO, OpenCyc, UMBEL and other terms available. Relationships like this are expressed as

<:Whatever> <ontology2:ak>
<http://mpii.de/yago/resource/Central_Air_Force_Museum> .

<ontology2:aka>, not dereferencable yet, means (roughly) that “some people use term X to refer to substantially the same thing as term Y.” It’s my own answer to the <owl:sameAs> problem and deliberately leaves the exact semantics to the reader. (It’s a lossy expression of the data structures that I use for entity management)

This is very like a TMRM solution since it gathers different identifications together, in hopes that at least one will be understood by a reader.

This is very unlike a TMRM solution because it has no legend to say how to compare these “values,” must less their “key.”

The lack of a legend makes integration in legal, technical, medical or intelligence applications, ah, difficult.

Still, it is encouraging to see the better Linked Data applications moving in the direction of the TMRM.

In Defense of Ambiguity

Filed under: OWL,RDF,Semantic Web,Subject Identity — Patrick Durusau @ 5:49 pm

by Patrick J. Hayes and Harry Halpin was cited in David Booth’s article so like any academic, I had to go read the cited paper. 😉

Highly recommended.

The authors conclude:

Regardless of the details, the use of any technology in Web architecture to distinguish between access and reference, including our proposed ex:refersTo and ex:describedBy, does nothing more than allow the author of a URI to explain how they would like the URI to be used. Ultimately, there is nothing that Web architecture can do to prevent a URI from being used to refer to some thing non-accessible. However, at least having a clear and coherent device, such as a few RDF predicates, would allow the distinction to be made so the author could give guidance on what they believe best practice for their URI would be. This would vastly improve the situation from where it is today, where this distinction is impossible. The philosophical case for the distinction between reference and access is clear. The main advantage of Web architecture is that there is now a de facto universal identification scheme for accessing networked resources. With the Semantic Web, we can now extend this scheme to the wide world outside the Web by use of reference. By keeping the distinction between reference and access clear, the lemons of ambiguity can be turned into lemonade. Reference is inherently ambiguous, and ambiguity is not an error of communication, but fundamental to the success of communication both on and off the Web.

Sounds like the distinction between subject locators and identifiers that topic maps made long before this paper was written.

Resource Identity and Semantic Extensions: Making Sense of Ambiguity

Filed under: OWL,RDF,Semantic Web,Subject Identity — Patrick Durusau @ 5:29 pm

Resource Identity and Semantic Extensions: Making Sense of Ambiguity David Booth’s paper was cited by Bernard Vatant so I had to go take a look.

Bernard says: “The best analysis of the issue I’ve read so far.” I have to agree.

From the paper’s conclusion:

In general, a URI’s resource identity will necessarily be ambiguous. But this is not the end of the world. Rather, it means that while it may be unambiguous enough for one application, another application may require finer distinctions and thus consider it ambiguous. However, this ambiguity of resource identity can be precisely constrained by the use of URI declarations. Finally, a standard process is proposed for determining a URI’s resource identity.

Ambiguity is part and parcel of any system and the real question is how much can you tolerate?

For some systems that is quite a bit, for others, air traffic controllers come to mind, as little as possible.

Other identifiers are ambiguous as well.

Successful integration of data across systems depends on how well we deal with that ambiguity.

November 4, 2010

Subject Identification Patterns

Filed under: Authoring Topic Maps,Subject Identifiers,Subject Identity,Subject Locators — Patrick Durusau @ 10:27 am

Does that sound like a good book title?

Thinking that since everyone is recycling old stuff under the patterns rubric that topic maps may as well jump on the bandwagon.

Instead of the three amigos (was that a movie?) we could have the dirty dozen honchos (or was that another movie?). I don’t get out much these days so I would probably need some help with current cultural references.

This ties into Lars Heuer’s effort to distinguish between Playboy Playmates and Astronauts, while trying to figure out why birds keep, well, let’s just say he has to wash his hair a lot.

When you have an entry from DBpedia, what do you have to know to identify it? Its URI is one thing but I rarely encounter URIs while shopping. (Or playmates for that matter.)

October 30, 2010

Sense and Reference on the Web

Filed under: Semantic Web,Semantics,Subject Identity — Patrick Durusau @ 10:01 am

Sense and Reference on the Web is Harry Halpin’s thesis seeking to answer the question: “What does a Uniform Resource Identifier (URI) mean?”

Abstract:

This thesis builds a foundation for the philosophy of the Web by examining the crucial question: What does a Uniform Resource Identifier (URI) mean? Does it have a sense, and can it refer to things? A philosophical and historical introduction to the Web explains the primary purpose of the Web as a universal information space for naming and accessing information via URIs. A terminology, based on distinctions in philosophy, is employed to define precisely what is meant by information, language, representation, and reference. These terms are then employed to create a foundational ontology and principles of Web architecture. From this perspective, the Semantic Web is then viewed as the application of the principles of Web architecture to knowledge representation. However, the classical philosophical problems of sense and reference that have been the source of debate within the philosophy of language return. Three main positions are inspected: the logicist position, as exemplified by the descriptivist theory of reference and the first-generation Semantic Web, the direct reference position, as exemplified by Putnam and Kripke’s causal theory of reference and the second-generation Linked Data initiative, and a Wittgensteinian position that views the Semantic Web as yet another public language. After identifying the public language position as the most promising, a solution of using people’s everyday use of search engines as relevance feedback is proposed as a Wittgensteinian way to determine sense of URIs. This solution is then evaluated on a sample of the Semantic Web discovered by via using queries from a hypertext search engine query log. The results are evaluated and the technique of using relevance feedback from hypertext Web searches to determine relevant Semantic Web URIs in response to user queries is shown to considerably improve baseline performance. Future work for the Web that follows from our argument and experiments is detailed, and outlines of a future philosophy of the Web laid out.

Questions:

  1. Choose a non-Web reference system.
  2. What is the nature of those references? (3-5 pages, with citations)
  3. Compare those references to URIs.
  4. How are those references and URIs the same/different? (3-5 pages, with citations)
  5. Evaluate Halpin’s use of Wittgenstein. (5-10 pages, with citations)

October 24, 2010

Recognizing Synonyms

Filed under: Marketing,Subject Identity,Synonymy — Patrick Durusau @ 11:04 am

I saw a synonym that I recognized the other day and started wondering how I recognized it?

The word I had in mind was “student” and the synonym was “pupil.”

Attempts to recognize synonyms:

  • spelling: student, pupil – No.
  • length: student 7 letters, pupil 5 letters – No.
  • origin: student – late 14c., from O.Fr. estudient , pupil – from O.Fr. pupille (14c.) – No. [1]
  • numerology: student (a = 1, b = 2 …) student = 19 + 20 + 21 + 4 + 5 + 14 + 20 = 69 ; pupil = 16 + 21 + 16 + 9 + 12 = 74 – No [2].

But I know “student” and “pupil” to be synonyms.[3]

I could just declare them to be synonyms.

But then how do I answer questions like:

  • Why did I think “student” and “pupil” were synonyms?
  • What would make some other term a synonym of either “student” or “pupil?”
  • How can an automated system match my finding of more synonyms?

Provisional thoughts on answers to follow this week.

Questions:

Without reviewing my answers in this series, pick a pair of synonyms and answer those three questions for that pair. (There are different answers than mine.)

*****

[1] Synonym origins from: Online Etymology Dictionary

[2] There may be some Bible code type operation that can discover synonyms but I am unaware of it.

[3] They are synonyms now, that wasn’t always the case.

October 23, 2010

SLiMSearch: A Webserver for Finding Novel Occurrences of Short Linear Motifs in Proteins, Incorporating Sequence Context

Filed under: Bioinformatics,Biomedical,Pattern Recognition,Subject Identity — Patrick Durusau @ 5:56 am

SLiMSearch: A Webserver for Finding Novel Occurrences of Short Linear Motifs in Proteins, Incorporating Sequence Context Authors: Norman E. Davey, Niall J. Haslam, Denis C. Shields, Richard J. Edwards Keywords: short linear motif, motif discovery, minimotif, elm

Short, linear motifs (SLiMs) play a critical role in many biological processes. The SLiMSearch (Short, Linear Motif Search) webserver is a flexible tool that enables researchers to identify novel occurrences of pre-defined SLiMs in sets of proteins. Numerous masking options give the user great control over the contextual information to be included in the analyses, including evolutionary filtering and protein structural disorder. User-friendly output and visualizations of motif context allow the user to quickly gain insight into the validity of a putatively functional motif occurrence. Users can search motifs against the human proteome, or submit their own datasets of UniProt proteins, in which case motif support within the dataset is statistically assessed for over- and under-representation, accounting for evolutionary relationships between input proteins. SLiMSearch is freely available as open source Python modules and all webserver results are available for download. The SLiMSearch server is available at: http://bioware.ucd.ie/slimsearch.html .

Software: http://bioware.ucd.ie/slimsearch.html

Seemed like an appropriate resource to follow on today’s earlier posting.

Note in the keywords, “elm.”

Care to guess what that means? If you are a bioinformatics or biology person you may get it correct.

What do you think the odds are that any person much less a general search engine will get it correct?

Topic maps are about making sure you find: Eukaryotic Linear Motif Resource without wading through what a search of any common search engine returns for “elm.”

Questions:

  1. What other terms in this paper represent other subjects?
  2. What properties would you use to identify those subjects?
  3. How would you communicate those subjects to someone else?

An Algorithm to Find All Identical Motifs in Multiple Biological Sequences

Filed under: Bioinformatics,Biomedical,Pattern Recognition,Subject Identity — Patrick Durusau @ 5:25 am

An Algorithm to Find All Identical Motifs in Multiple Biological Sequences Authors: Ashish Kishor Bindal, R. Sabarinathan, J. Sridhar, D. Sherlin, K. Sekar Keywords: Sequence motifs, nucleotide and protein sequences, identical motifs, dynamic programming, direct repeat and phylogenetic relationships

Sequence motifs are of greater biological importance in nucleotide and protein sequences. The conserved occurrence of identical motifs represents the functional significance and helps to classify the biological sequences. In this paper, a new algorithm is proposed to find all identical motifs in multiple nucleotide or protein sequences. The proposed algorithm uses the concept of dynamic programming. The application of this algorithm includes the identification of (a) conserved identical sequence motifs and (b) identical or direct repeat sequence motifs across multiple biological sequences (nucleotide or protein sequences). Further, the proposed algorithm facilitates the analysis of comparative internal sequence repeats for the evolutionary studies which helps to derive the phylogenetic relationships from the distribution of repeats.

Good illustration that subject identification, here sequence motifs in nucleotide and protein sequences, varies by domain.

Subject matching in this type of data on the basis of assigned URL identifiers for sequence motifs would be silly.

But that’s the question isn’t it? What is the appropriate basis for subject matching in a particular domain?

Questions:

  1. Identify and describe one (1) domain where URL matching for subjects would be unnecessary overhead. (3 pages, no citations)
  2. Identify and describe one (1) domain where URL matching for subjects would be useful. (3 pages, no citations)
  3. What are the advantages of URLs as a lingua franca? (3 pages, no citations)
  4. What are the disadvantages of URLs as a lingua franca? (3 pages, no citations)

***
BTW, when you see “no citations” that does not mean you should not be reading the relevant literature. What is means is that I want your analysis of the issues and not your channeling of the latest literature.

October 22, 2010

Rethinking Library Linking: Breathing New Life into OpenURL

Filed under: Cataloging,Indexing,OpenURL,Subject Identity,Topic Maps — Patrick Durusau @ 7:26 am

Rethinking Library Linking: Breathing New Life into OpenURL Authors: Cindi Trainor and Jason Price

Abstract:

OpenURL was devised to solve the “appropriate copy problem.” As online content proliferated, it became possible for libraries to obtain the same content from multiple locales: directly from publishers and subscription agents; indirectly through licensing citation databases that contain full text; and, increasingly, from free online sources. Before the advent of OpenURL, the only way to know whether a journal was held by the library was to search multiple resources. An OpenURL link resolver accepts links from library citation databases (sources) and returns to the user a menu of choices (targets) that may include links to full text, the library catalog, and other related services (figure 1). Key to understanding OpenURL is the concept of “context sensitive” linking: links to the same item will be different for users of different libraries, and are dependent on the library’s collections. This issue of Library Technology Reports provides practicing librarians with real-world examples and strategies for improving resolver usability and functionality in their own institutions.

Resources:

OpenURL (ANSI/NISO Z39.88-2004

openURL@oclc.org archives

Questions:

  1. OCLC says of OpenURL

    Remember the card catalog? Everything in a library was represented in the card catalog with one or more cards carrying bibliographic information. OpenURL is the internet equivalent of those index cards.

  2. True? 3-5 pages, no citations, or
  3. False? 3-5 pages, no citations.

October 21, 2010

mloss.org – machine learning open source software

mloss.org – machine learning open source software

Open source repository of machine learning software.

Not only are subjects being recognized by these software packages but their processes and choices are subjects as well. Not to mention their description in the literature.

Fruitful grounds for adaptation to topic maps as well as being the subject of topic maps.

There are literally hundreds of software packages here so I welcome suggestions, comments, etc. on any and all of them.

Questions:

  1. Examples of vocabulary mis-match in machine learning literature?
  2. Using one sample data set, how would you integrate results from different packages? Assume you are not merging classifiers.
  3. What if the classifiers are unknown? That is all you have are the final results. Is your result different? Reliable?
  4. Describe a (singular) merging of classifiers in subject identity terms.

October 20, 2010

Integrating Biological Data – Not A URL In Sight!

Actual title: Kernel methods for integrating biological data by Dick de Ridder, The Delft Bioinformatics Lab, Delft University of Technology.

Biological data integration to improve protein expression – read hugely profitable industrial processes based on biology.

Need to integrate biological data, including “prior knowledge.”

In case kernel methods aren’t your “thing,” one important point:

There are vast seas of economically important data unsullied by URLs.

Kernel methods are one method to integrate some of that data.

Questions:

  1. How to integrate kernel methods into topic maps? (research project)
  2. Subjects in a kernel method? (research paper, limit to one method)
  3. Modeling specific uses of kernels in topic maps. (research project)
  4. Edges of kernels? Are there subject limits to kernels? (research project>

October 14, 2010

linloglayout

Filed under: Clustering,Graphs,Subject Identity — Patrick Durusau @ 10:45 am

linloglayout

Overview:

LinLogLayout is a simple program for computing graph layouts (positions of graph nodes in two- or three-dimensional space) and graph clusterings. It reads a graph from a file, computes a layout and a clustering, writes the layout and the clustering to a file, and displays them in a dialog. LinLogLayout can be used to identify groups of densely connected nodes in graphs, like communities of friends or collaborators in social networks, related documents in hyperlink structures (e.g. web graphs), cohesive subsystems in software systems, etc. With a change of a parameter in the main method, it can also compute classical “nice” (i.e. readable) force-directed layouts.

Finding “densely connected nodes” is one step towards finding subjects.

Subject finding tool kits will include a variety of such techniques.

October 13, 2010

Semantic Drift: What Are Linked Data/RDF and TMDM Topic Maps Missing?

Filed under: Linked Data,RDF,Subject Identifiers,Subject Identity,Topic Maps — Patrick Durusau @ 9:38 am

One RDF approach to semantic drift is to situate a vocabulary among other terms.

TMDM topic maps enable a user to gather up information that they considered as identifying the subject in question.

Additional information helps to identify a particular subject. (RDF/TMDM approaches)

Isn’t that the opposite of semantic drift?

What’s happening in both cases?

The RDF approach is guessing that it has the sense of the word as used by the author (if the right word at all).

Kelb reports approximately 48% precision.

So in 1 out of 2 emergency room situations we get the right term? (Not to knock Kelb’s work. It is an important approach that needs further development.)

Topic maps are guessing as well.

We don’t know what information in a subject identifier identifies a subject. Some of it? All of it? Under what circumstances?

Question: What information identifies a subject, at least to its author?

Answer: Ask the Author.

Asking authors what information identifies their subject(s) seems like an overlooked approach.

Domain specific vocabularies with additional information about subjects that indicates the information that identifies a subject versus merely supplemental information would be a good start.

That avoids inline syntax difficulties and enables authors to easily and quickly associate subject identification information with their documents.

Both RDF and TMDM Topic Maps could use the same vocabularies to improve their handling of associated document content.

October 12, 2010

Semantic Drift: A Topic Map Answer (sort-of)

Filed under: Subject Identifiers,Subject Identity,TMDM,Topic Maps,XTM — Patrick Durusau @ 6:37 am

Topic maps took a different approach to the problem of identifying subjects (than RDF) and so looks at semantic drift differently.

In the original 13250, subject descriptor was defined as:

3.19 subject descriptor – Information which is intended to provide a positive, unambiguous indication of the identity of a subject, and which is the referent of an identity attribute of a topic link.

When 13250 was reformulated to focus on the XTM syntax and the legend known as the Topic Maps Data Model (TMDM), the subject descriptor of old became subject identifiers. (Clause 7, TMDM)

A subject identifier has information that identifies a subject.

The author of a topic uses information that identifies a subject to create a subject identifier. (Which is represented in a topic map by an IRI.)

Anyone can look at the subject identifier to see if they are talking about the same subject.

They are responsible for catching semantic drift if it occurs.

But, there is something missing from RDF and topic maps.

Something that would help with semantic drift, although they would use it differently.

Care to take a guess?

October 11, 2010

Semantic Drift: An RDF Answer (sort-of)

Filed under: RDF,Semantic Web,Subject Identity — Patrick Durusau @ 7:27 am

As promised last week, there are RDF researchers working on issues related to semantic drift.

An interesting approach can be found in: Entity Reference Resolution via Spreading Activation on RDF-Graphs Authors(s): Joachim Kleb, Andreas Abecker

Abstract:

The use of natural language identifiers as reference for ontology elements—in addition to the URIs required by the Semantic Web standards—is of utmost importance because of their predominance in the human everyday life, i.e.speech or print media. Depending on the context, different names can be chosen for one and the same element, and the same element can be referenced by different names. Here homonymy and synonymy are the main cause of ambiguity in perceiving which concrete unique ontology element ought to be referenced by a specific natural language identifier describing an entity. We propose a novel method to resolve entity references under the aspect of ambiguity which explores only formal background knowledge represented in RDF graph structures. The key idea of our domain independent approach is to build an entity network with the most likely referenced ontology elements by constructing steiner graphs based on spreading activation. In addition to exploiting complex graph structures, we devise a new ranking technique that characterises the likelihood of entities in this network, i.e. interpretation contexts. Experiments in a highly polysemic domain show the ability of the algorithm to retrieve the correct ontology elements in almost all cases.

It is the situating of a concept in a context (not assignment of a URI) that enables the correct result in a polysemic domain.

This doesn’t directly model semantic drift but does represent anchoring a term in a particular context.

The questions that divides semantic technologies are:

  • Who throws the anchor?
  • Who governs the anchors?
  • Can there be more than one anchor?
  • What about “my” anchor?
  • …and others

More on those anon.

Finding Itemset-Sharing Patterns in a Large Itemset-Associated Graph

Filed under: Data Mining,Graphs,Similarity,Subject Identity — Patrick Durusau @ 6:37 am

Finding Itemset-Sharing Patterns in a Large Itemset-Associated Graph Authors: Mutsumi Fukuzaki, Mio Seki, Hisashi Kashima, Jun Sese

Abstract:

Itemset mining and graph mining have attracted considerable attention in the field of data mining, since they have many important applications in various areas such as biology, marketing, and social network analysis. However, most existing studies focus only on either itemset mining or graph mining, and only a few studies have addressed a combination of both. In this paper, we introduce a new problem which we call itemset-sharing subgraph (ISS) set enumeration, where the task is to find sets of subgraphs with common itemsets in a large graph in which each vertex has an associated itemset. The problem has various interesting potential applications such as in side-effect analysis in drug discovery and the analysis of the influence of word-of-mouth communication in marketing in social networks. We propose an efficient algorithm ROBIN for finding ISS sets in such graph; this algorithm enumerates connected subgraphs having common itemsets and finds their combinations. Experiments using a synthetic network verify that our method can efficiently process networks with more than one million edges. Experiments using a real biological network show that our algorithm can find biologically interesting patterns. We also apply ROBIN to a citation network and find successful collaborative research works.

If you think of a set of properties, “itemset,” as a topic and an “itemset-sharing subgraph (ISS)” as a match/merging criteria, the relevance of this paper to topic maps becomes immediately obvious.

Useful for both discovery of topics in data sets as well as part processing a topic map.

October 8, 2010

Semantic Drift and Linked Data/Semantic Web

Filed under: Linked Data,OWL,Semantic Web,Subject Identity — Patrick Durusau @ 10:28 am

Overloading OWL sameAs starts with:

Description: General Issue: owl:sameAs is being used in the linked data community in a way that is inconsistent with its semantics.

Read the document but in summary: People use OWL sameAs to mean different things.

I don’t see how their usage can be “inconsistent with its semantics.”

Words don’t possess self-executing semantics that bind us. Rather the other way round I think.

If OWL sameAs had some “original” semantic, it changed by the process of semantic drift.

Semantic drift is where the semantics of a token changes over time or across communities due to its use by people.

URIs or tokens may be “stable,” but the evidence is that the semantics of URIs or tokens are not.

The question is how to manage changing, emerging, drifting semantics? (Not a question answered by a static semantic model of URI based identity.)

PS: RDF researchers have recognized semantic drift and have proposed solutions for addressing it. More on that anon.

Questions:

  • Select a classification more than 30 years old and randomly select one book for each 5 year period for the last 30 years. What (if any) semantic drift do you see in the use of this classification?
  • Exchange your list with a classmate. Do you agree/disagree with their evaluation? Why?
  • Repeat the exercise in #1 and #2 but use a classification where you can find books between 30 and 60 years ago. Select one book per 5 year period.

A Haptic-Based Framework for Chemistry Education: Experiencing Molecular Interactions with Touch

A Haptic-Based Framework for Chemistry Education: Experiencing Molecular Interactions with Touch Author(s): Sara Comai, Davide Mazza Keywords: Haptic technology – Chemical education and teaching – Molecular interaction

Abstract:

The science of haptics has received a great attention in the last decade for data visualization and training. In particular haptics can be introduced as a novel technology for educational purposes. The usage of haptic technologies can greatly help to make the students feel sensations not directly experienceable and typically only reported as notions, sometimes also counter-intuitively, in textbooks. In this work, we present a haptically-enhanced system for the tactile exploration of molecules. After a brief description of the architecture of the developed system, the paper describes how it has been introduced in the usual didactic activity by providing a support for the comprehension of concepts typically explained only theoretically. Users feedbacks and impressions are reported as results of this innovation in teaching.

Imagine researchers using haptics to recognize molecules or molecular reactions.

Are the instances of recognition to be compared with other such instances?

How would you establish the boundaries for a “match?”

How would you communicate those boundaries to others?

October 7, 2010

Public Interchangeable Identifier

Filed under: Subject Identifiers,Subject Identity,Topic Maps — Patrick Durusau @ 7:19 am

I mentioned yesterday that creating a public interchangeable identifier isn’t as easy as identifying identifier and documenting them publicly. Recognizing an Interchangeable Identifier

What if I identified (by some means) “Patrick” as an identifier and posted it to my website (public documentation).

Is that now a “public interchangeable identifier?”

No. Why?

First, there has to be some agreed upon means to declare an identifier to be an identifier. When I say agreed upon, it need not be something as formal as a standard but it has to be recognized by a community of users.

Second, it is important to know in what context this is an identifier? Akin to what we talk about as “scope” in topic maps. But with the recognition that the notion of “unconstrained” scope is a pernicious fiction. Scope may be unspecified but it is never unconstrained.

I would argue that no identifier exists without some defined scope. It may not be known or specified but the essence of an identifier, that it identifies some subject, exists only within some scope.

More on means to declare identifiers and their context anon.

October 6, 2010

Recognizing an Interchangeable Identifier

Filed under: Indexing,Semantics,Subject Identifiers,Subject Identity — Patrick Durusau @ 7:13 am

Subjects & Identifiers shows why we need interchangeable identifiers.

Q: How would you recognize an interchangeable identifier?

A: Oh, yeah, that’s right. Anything we can talk about has an identifier, so how to recognize an interchangeable identifier?

If two people agree on column headers for a database table, they have interchangeable identifiers for the columns, at least between the two of them.

There are two requirements for interchangeable identifiers:

  1. Identification as an identifier.
  2. Notice of the identifier.

Any token can be an identifier under some circumstances so identifiers must be identified for interchange.

Notice of an identifier is usually a matter of being part of a profession or discipline. Some term is an identifier because it was taught to you as one.

That works but for local interchange, but public interchange requires publicly documented identifiers.

That’s it. Identify identifiers and document the identifiers publicly and you will have public interchangeable identifiers.

It can’t be that simple? Well, truthfully, it’s not.

More on public interchangeable identifiers forthcoming.

October 3, 2010

Subjects & Identifiers

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 7:27 am

For all the talk about assigning subjects identifiers, all subjects already have identifiers.

The ones we can talk about anyway.* Try it, you will see what I mean. As soon as you say a name or otherwise identify a subject, it has an identifier.

In the classic topic map use case, mapping indexes together, all the subjects had identifiers, the words the indexers had used. But the indexers had used the same words for different subjects and different words for the same subjects.

The search for universal identifiers is a known dead end, so what is the next best solution?

Interchangeable Identifiers.

Interchangeable identifiers provide more information to assist in matching up different identifiers for the same subjects. And distinguishing different subjects.

The development of “interchange” markup for texts and data started over twenty (20) years ago and continues today.

The sooner we start exploring interchangeable identifiers the sooner we will make up for lost time.

*(I don’t worry about subjects I can’t talk about.)

September 30, 2010

Assessing the scenic route: measuring the value of search trails in web logs

Filed under: Authoring Topic Maps,Searching,Subject Identity,Topic Maps — Patrick Durusau @ 10:34 am

Assessing the scenic route: measuring the value of search trails in web logs Authors: Ryen W. White, Jeff Huang Keywords: log analysis, search trails, trail following

Abstract:

Search trails mined from browser or toolbar logs comprise queries and the post-query pages that users visit. Implicit endorsements from many trails can be useful for search result ranking, where the presence of a page on a trail increases its query relevance. Follow-ing a search trail requires user effort, yet little is known about the benefit that users obtain from this activity versus, say, sticking with the clicked search result or jumping directly to the destination page at the end of the trail. In this paper, we present a log-based study estimating the user value of trail following. We compare the relevance, topic coverage, topic diversity, novelty, and utility of full trails over that provided by sub-trails, trail origins (landing pages), and trail destinations (pages where trails end). Our findings demonstrate significant value to users in following trails, especially for certain query types. The findings have implications for the design of search systems, *including trail recommendation systems that display trails on search result pages.* (emphasis added)

If your topic map client has search logs for internal resources, don’t neglect those as part of your topic map construction process. For identification of important subjects and navigation links between subjects.

This was the best paper for SIGIR 2010.

Plagiarism and Subject Identity

Filed under: Marketing,Subject Identity,Topic Maps — Patrick Durusau @ 9:32 am

Plagiarism detection is a form of detecting subject-sameness.

If you think of a document as a subject and say 95% of it is the same as another document, you could conclude that it is the same subject. (Or set your own level of duplication for subject-sameness.)

One of the early use cases for topic maps was avoiding the duplication of documentation (and billing for the same) for defense systems.

Detecting self-plagiarism from a law firm, vendor, contractor, consultant is one thing.

Putting those incidents together across a government agency, business, institution, or enterprise is a job for topic maps.

Entity Resolution – Journal of Data and Information Quality

Filed under: Entity Resolution,Heterogeneous Data,Subject Identity — Patrick Durusau @ 5:37 am

Special Issue on Entity Resolution.

The Journal of Data and Information Quality is a new journal from the ACM.

Calls for papers should not require ACM accounts for viewing.

I have re-ordered (to put the important stuff first) and reproduced the call below:

Important Dates

  • Submissions due: December 15, 2010
  • Acceptance Notification: April 30, 2011
  • Final Paper Due: June 30, 2011
  • Target Date for Special Issue: September 2011

Resources for authors include:

Topics of interest include, but are not limited to:

  • ER impacts on Information Quality and impacts of Information Quality
    on ER
  • ER frameworks and architectures
  • ER outcome/performance assessment and metrics
  • ER in special application domains and contexts
  • ER and high-performance computing (HPC)
  • ER education
  • ER case studies
  • Theoretical frameworks for ER and entity-based integration
  • Method and techniques for
    • Entity reference extraction
    • Entity reference resolution
    • Entity identity management and identity resolution
    • Entity relationship analysis

Entity resolution (ER) is a key process for improving data quality in data integration in modern information systems. ER covers a wide range of approaches to entity-based integration, known variously as merge/purge, record de-duplication, heterogeneous join, identity resolution, and customer recognition. More broadly, ER also includes a number of important pre- and post-integration activities, such as entity reference extraction and entity relationship analysis. Based on direct record matching strategies, such as those described by the Fellegi-Sunter Model, new theoretical frameworks are evolving to describe ER processes and outcomes that include other types of inferred and asserted reference linking techniques. Businesses have long recognized that the quality of their ER processes directly impacts the overall value of their information assets and the quality of the information products they produce. Government agencies and departments, including law enforcement and the intelligence community, are increasing their use of ER as a tool for accomplishing their missions as well. Recognizing the growing interest in ER theory and practice, and its impact on information quality in organizations, the ACM Journal of Data and Information Quality (JDIQ) will devote a special issue to innovative and high-quality research papers in this area. Papers that address any aspect of entity resolution are welcome.

September 28, 2010

Mining Billion-node Graphs: Patterns, Generators and Tools

Filed under: Authoring Topic Maps,Data Mining,Graphs,Software,Subject Identity — Patrick Durusau @ 9:38 am

Mining Billion-node Graphs: Patterns, Generators and Tools Author: Christos Faloutsos (CMU)

Presentation on the Pegasus (PETRA GrAph mining System) project.

If you have large amounts of real world data and need some motivation, take a look at this presentation.

Similarity and Duplicate Detection System for an OAI Compliant Federated Digital Library

Filed under: Duplicates,OAI,Subject Identity — Patrick Durusau @ 5:17 am

Similarity and Duplicate Detection System for an OAI Compliant Federated Digital Library Authors: Haseebulla M. Khan, Kurt Maly and Mohammad Zubair Keywords: OAI – duplicate detection – digital library – federation service

Abstract:

The Open Archives Initiative (OAI) is making feasible to build high level services such as a federated search service that harvests metadata from different data providers using the OAI protocol for metadata harvesting (OAI-PMH) and provides a unified search interface. There are numerous challenges to build and maintain a federation service, and one of them is managing duplicates. Detecting exact duplicates where two records have identical set of metadata fields is straight-forward. The problem arises when two or more records differ slightly due to data entry errors, for example. Many duplicate detection algorithms exist, but are computationally intensive for large federated digital library. In this paper, we propose an efficient duplication detection algorithm for a large federated digital library like Arc.

The authors discovered that title weight was more important than author weight in the discovery of duplicates. Working with a subset of 73 archives with 465,440 records. Would be interesting to apply this insight to a resource like WorldCat, where duplicates are a noticeable problem.

September 27, 2010

Employing Publically Available Biological Expert Knowledge from Protein-Protein Interaction Information

Filed under: Bioinformatics,Biomedical,Subject Identity — Patrick Durusau @ 7:18 pm

Employing Publically Available Biological Expert Knowledge from Protein-Protein Interaction Information Authors: Kristine A. Pattin, Jiang Gui, Jason H. Moore Keywords: GWAS – SNPs – Protien-protein interaction – Epistasis

Abstract:

Genome wide association studies (GWAS) are now allowing researchers to probe the depths of common complex human diseases, yet few have identified single sequence variants that confer disease susceptibility. As hypothesized, this is due the fact that multiple interacting factors influence clinical endpoint. Given the number of single nucleotide polymorphisms (SNPs) combinations grows exponentially with the number of SNPs being analyzed, computational methods designed to detect these interactions in smaller datasets are thus not applicable. Providing statistical expert knowledge has exhibited an improvement in their performance, and we believe biological expert knowledge to be as capable. Since one of the strongest demonstrations of the functional relationship between genes is protein-protein interactions, we present a method that exploits this information in genetic analyses. This study provides a step towards utilizing expert knowledge derived from public biological sources to assist computational intelligence algorithms in the search for epistasis.

Applying human knowledge “…to assist computational intelligence algorithms…,” sounds like subject identity and topic maps to me!

« Newer PostsOlder Posts »

Powered by WordPress