Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 7, 2010

A Library Case For Topic Maps

Filed under: Classification,Examples,Subject Identity,Topic Maps — Patrick Durusau @ 7:40 pm

Libraries would benefit from topic maps in a number of ways but I ran across a very specific one today.

To escape the paralyzing grip of library vendors, a number of open source projects for system, even state-wide library software projects are now underway.

OK, so you have a central registry of all the books. But the local libraries, have millions of books with call numbers already assigned.

Libraries can either spend years and $millions to transition to uniform identifiers (doesn’t that sound “webby” to you?) or they can keep the call number they have.

Here is a real life example of the call numbers for Everybody’s Plutarch:

920 PLU

920.3 P

920 PLUT

R 920 P

920 P

Solution? One record (can you say proxy?) for this book with details for the various individual library holdings.

Libraries are already doing this so what is the topic map payoff?

Say I write a review of Everybody’s Plutarch and post it to the local library system with call number 920 P.

With a topic map, users of the system with 920.3 P (or any of the others), will also see my review.

The topic map payoff is that we can benefit from the contributions of others as well as contribute ourselves.

(Without having to move in mental lock step.)

Open Provenance Model *
Ontology – RDF – Semantic Web

Filed under: Ontology,RDF,Semantic Web — Patrick Durusau @ 11:37 am

A spate of provenance ontology materials landed in my inbox today:

  1. Open Provenance Model Ontology (OPMO)
  2. Open Provenance Model Vocabulary (OPMV)
  3. Open Provenance Model (OPM)
  4. Provenance Vocabulary Mappings

We should could ourselves fortunate that the W3C working group did not title their document: Open Provenance Model Vocabulary Mappings.

The community would be better served with less clever and more descriptive naming.

No doubt the Open Provenance Model Vocabulary (#2 above) has some range of materials in mind.

I don’t know the presumed target but some candidates come to mind:

  • Art Museum Open Provenance Model (including looting/acquisition terms)
  • Library Open Provenance Model
  • Natural History Open Provenance Model
  • ….

I am, of course, giving the author’s the benefit of the doubt in presuming their intent was not to create a universal model of provenance.

For topic map purposes, the Provenance Vocabulary Mappings document (#4 above) is the most interesting. Read through it and then answer the questions below.

Questions:

  1. Assume you have yet another provenance vocabulary. On what basis would you map it to any of the other vocabularies discussed in #4?
  2. Most of the mappings in #4 give a rationale. How is that (if it is) different from properties and merging rules for topic maps?
  3. What should we do with mappings in #4 or elsewhere that don’t give a rationale?
  4. How should we represent rationales for mappings? Is there some alternative not considered by topic maps?

Summarize your thoughts in 3-5 pages for all four questions. They are too interrelated to answer separately. You can use citations if you like but these aren’t questions answered in the literature. Or, well, at least I don’t find any of the answers in the literature convincing. 😉 Your experience may vary.

United States of Autocomplete – Post

Filed under: Interface Research/Design,Visualization — Patrick Durusau @ 7:08 am

United States of Autocomplete also via Flowing Data

Deeply amusing visualization of autocompletion of search terms by the Google search engine.

Most data manipulation/visualization focuses on some data set.

What if those techniques were turned inward?

That is to say what if data manipulation/visualization choices were visualized? Either in terms of the choices they make or the processes they use?

I suspect that is a largely unexplored area of visualization and one where given the differences in terminology, topic map could be helpful.

Questions:

  1. What data manipulation/visualization choices would you suggest for visualization? What about them do you think could be visualized? (3-5 pages, citations)
  2. How should we go about designing a visualization for #1? What is it that we wish to illustrate? (3-5 pages, citations)
  3. How would we compare that visualization to other visualizations? In topic map terms, what are the subjects and how do we identify them? (3-5 pages, citations)

PS: One additional thought. What if we had a data set and compared two techniques, manipulation or visualization, on the basis of how they treated the data? What data was lost or not included by one technique but was by another? And does it vary by the nature of the data set?

Could be useful for topic map engineers trying to decide on the best tools for a particular job. I am sure that sort of information is generally available in data mining books or in stories about data mining exercises, but being able to visualize the same could be quite useful.

Visualizing Similarities

Filed under: Interface Research/Design,Visualization — Patrick Durusau @ 6:14 am

Flowing Data reports on Similarities Between PhD Dissertations

From Flowing Data:

Certain fields of study tend to cover many of the same topics. Many times, the two fields go hand-in-hand. Electrical engineering, for example, ties tightly with computer science. Same thing between education and sociology. Daniel Ramage and Jason Chuang of Stanford University explore these similarities through the language used in their school’s dissertations.

Topic distance is explored between departments from 1993 to 2008. Select a department and it goes to the middle of the circle. Departments with dissertations that were similar for the year are highlighted and near closer. The closer to the center, the more similar. Alternatively, you should also be able to find departments who are fairly different from the rest.

You can also use the time slider on the bottom to check out how some departments grow closer to another over time.

Questions:

  1. Interesting visualization but how would you suggest learning more about the basis for “similarity?” (3-5 pages, no citations)
  2. More generally, how would you go about evaluating claims of “similarity?” Would you compare different measures or use some other methodology? (3-5 pages, no citations)
  3. Can you suggest other visualizations of “similarity?” (1-3 pages, citations/hyperlinks)

December 6, 2010

KissKissBan

KissKissBan: A Competitive Human Computation Game for Image Annotation Authors: Chien-Ju Ho, Tao-Hsuan Chang, Jong-Chuan Lee, Jane Yung-jen Hsu, Kuan-Ta Chen Keywords: Amazon Mechanical Turk, ESP Game, Games With A Purpose, Human Computation, Image Annotation

Abstract:

In this paper, we propose a competitive human computation game, KissKissBan (KKB), for image annotation. KKB is different from other human computation games since it integrates both collaborative and competitive elements in the game design. In a KKB game, one player, the blocker, competes with the other two collaborative players, the couples; while the couples try to find consensual descriptions about an image, the blocker’s mission is to prevent the couples from reaching consensus. Because of its design, KKB possesses two nice properties over the traditional human computation game. First, since the blocker is encouraged to stop the couples from reaching consensual descriptions, he will try to detect and prevent coalition between the couples; therefore, these efforts naturally form a player-level cheating-proof mechanism. Second, to evade the restrictions set by the blocker, the couples would endeavor to bring up a more diverse set of image annotations. Experiments hosted on Amazon Mechanical Turk and a gameplay survey involving 17 participants have shown that KKB is a fun and efficient game for collecting diverse image annotations.

This article makes me wonder about the use of “games” for the construction of topic maps?

I don’t know of any theoretical reason why topic map construction has to resemble a visit to the dentist office. 😉

Or for that matter, why does a user needs to know they are authoring/using a topic map at all?

Questions:

  1. What other game or game like scenario’s do you think lend themselves to the creation of online content? (3-5 pages, citations)
  2. What type of information do you think users could usefully contribute to a topic map (whether known to be a topic map or not)? (3-5 pages, no citations)
  3. Sketch out a proposal for an online game that adds information, focusing on incentives and the information contributed. (3-5 pages, no citations)

Survey on Social Tagging Techniques

Filed under: Bookmarking,Classification,Folksonomy,Tagging — Patrick Durusau @ 6:37 am

Survey on Social Tagging Techniques Authors: Manish Gupta, Rui Li, Zhijun Yin, Jiawei Han Keywords: Social tagging, bookmarking, tagging, social indexing, social classification, collaborative tagging, folksonomy, folk classification, ethnoclassification, distributed classification, folk taxonomy

Abstract:

Social tagging on online portals has become a trend now. It has emerged as one of the best ways of associating metadata with web objects. With the increase in the kinds of web objects becoming available, collaborative tagging of such objects is also developing along new dimensions. This popularity has led to a vast literature on social tagging. In this survey paper, we would like to summarize different techniques employed to study various aspects of tagging. Broadly, we would discuss about properties of tag streams, tagging models, tag semantics, generating recommendations using tags, visualizations of tags, applications of tags and problems associated with tagging usage. We would discuss topics like why people tag, what influences the choice of tags, how to model the tagging process, kinds of tags, different power laws observed in tagging domain, how tags are created, how to choose the right tags for recommendation, etc. We conclude with thoughts on future work in the area.

I recommend this survey in part due to its depth but also for not lacking a viewpoint:

…But fixed static taxonomies are rigid, conservative, and centralized. [cite omitted]…Hierarchical classifications are influenced by the cataloguer’s view of the world and, as a consequence, are affected by subjectivity and cultural bias. Rigid hierarchical classification systems cannot easily keep up with an increasing and evolving corpus of items…By their very nature, hierarchies tend to establish only one consistent, authoritative structured vision. This implies a loss of precision, erases differences of expression, and does not take into account the variety of user needs and views.

I am not innocent of having made similar arguments in other contexts. It makes good press among the young and dissatisfied, it doesn’t bear up to close scrutiny.

For example, the claim is made that “hierarchical classifications” are “affected by subjectivity and cultural bias.” The implied claim is that social tagging is not. Yes? I would argue that all classification, hierarchical and otherwise is affected by “subjectivity and cultural bias.”

Questions:

  1. Choose one of the other claims about hierarchical classifications. Is is also true of social tagging? Why/Why not? (3-5 pages, no citations)
  2. Choose a social tagging practice. What are its strengths/weaknesses? (3-5 pages, no citations)
  3. How would you use topic maps with the social tagging practice in #2? (3-5 pages, no citations)

A Brief Survey on Sequence Classification

Filed under: Data Mining,Pattern Recognition,Sequence Classification,Subject Identity — Patrick Durusau @ 5:56 am

A Brief Survey on Sequence Classification Authors: Zhengzheng Xing, Jian Pei, Eamonn Keogh

Abstract:

Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.

Excellent survey article on sequence classification, which as the authors note, is a rapidly developing field of research.

This article was published in the “newsletter” of the ACM Special Interest Group on Knowledge Discovery and Data Mining. Far more substantive material than I am accustomed to seeing in any “newsletter.”

The ACM has very attractive student discounts and if you are serious about being an information professional, it is one of the organizations that I would recommend in addition to the usual library suspects.

to_be_classified: A Facet Analysis of a Folksonomy

Filed under: Classification,Facets,Folksonomy,Ranganathan — Patrick Durusau @ 5:37 am

to_be_classified: A Facet Analysis of a Folksonomy Author Elise Conradi Keywords Facet analysis, Faceted classification, VDP::Samfunnsvitenskap: 200::Biblioteks- og informasjonsvitenskap: 320::Kunnskapsgjenfinning og organisering: 323

Abstract:

This research examines Ranganathan’s postulational approach to facet analysis with the intention of manually inducing a faceted classification ontology from a folksonomy. Folksonomies are viewed as a source to a wealth of data representing users’ perspectives. An in-depth study of faceted classification theory is used to form a methodology based on the postulational approach. The dataset used to test the methodology consists of over 107,000 instances of 1,275 unique tags representing 76 popular non-fiction history books collected from the LibraryThing folksonomy. Preliminary results of the facet analysis indicate the manual inducement of two faceted classification ontologies in the dataset; one representing the universe of books and one representing the universe of subjects within the universe of books. The ontology representing the universe of books is considered to be complete, whereas the ontology representing the universe of subjects is incomplete. These differences are discussed in light of theoretical differences between special and universal faceted classifications. The induced ontologies are then discussed in terms of their substantiation or violation of Ranganathan’s Canons of Classification.

Highly recommended. Expect back references to this entry in the coming months.

Questions:

  1. Is Ranganathan’s “idea plane” for work in classification different from Husserl’s “bracketing?” If so, how? (3-5 pages, citations)
  2. How would you distinguish the “idea plane” from the “verbal plane?” (3-5 pages, no citations)
  3. How would you compare the “idea planes” as seen by two different classifiers? (3-5 pages, no citations)

GT.M High end TP database engine

Filed under: Data Structures,GT.M,node-js,NoSQL,Software — Patrick Durusau @ 4:55 am

GT.M High end TP database engine (Sourceforge)

Description from the commercial version:

The GT.M data model is a hierarchical associative memory (i.e., multi-dimensional array) that imposes no restrictions on the data types of the indexes and the content – the application logic can impose any schema, dictionary or data organization suited to its problem domain.* GT.M’s compiler for the standard M (also known as MUMPS) scripting language implements full support for ACID (Atomic, Consistent, Isolated, Durable) transactions, using optimistic concurrency control and software transactional memory (STM) that resolves the common mismatch between databases and programming languages. Its unique ability to create and deploy logical multi-site configurations of applications provides unrivaled continuity of business in the face of not just unplanned events, but also planned events, including planned events that include changes to application logic and schema.

There are clients for node-js:

http://github.com/robtweed/node-mdbm
http://github.com/robtweed/node-mwire

Local topic map software is interesting and useful but for scaling to the enterprise level, something different is going to be required.

Reports of implementing the TMDM or other topic map legends with a GT.M based system are welcome.

December 5, 2010

Amazon Web Services

Filed under: Software,Topic Map Software — Patrick Durusau @ 8:27 pm

Amazon Web Services

The recent Wikileaks story drew my attention to the web services offered by Amazon. Knew they were there but had not really paid as much attention as I should.

I don’t know the details but be aware that there is a one year free service tier to introduce new users to the cloud.

Curious if anyone is already offering topic map services with someone like Amazon Web Services?

Subject identity management as a service seems like a likely commodity in the cloud.

Data sets may expose different identity APIs as it were depending upon the degree of access required.

International Conference on Biomedical Ontology

Filed under: Bioinformatics,Biomedical,Conferences — Patrick Durusau @ 8:17 pm

International Conference on Biomedical Ontology Buffalo, NY July 26-20, 2011

February 1st: Deadline for workshop and tutorial proposals
March 1st: Deadline for papers

Call for Paper Details

Emphasis on:

  • Techniques and technologies for collaborative ontology development
  • Reasoning with biomedical ontologies
  • Evaluation of biomedical ontologies
  • Biomedical ontology and the Semantic Web
    Ontologies for :

  • Biomedical imaging
  • Biochemistry and drug discovery
  • Biomedical investigations, experimentation, clinical trials
  • Clinical and translational research
  • Development and anatomy
  • Electronic health records
  • Evolution and phylogeny
  • Metagenomics
  • Neuroscience, psychiatry, cognition

Questions:

  1. What role (if any) do you see for topic maps in biomedical ontology development, review or use? (3-5 pages, no citations)
  2. Choose a biomedical ontology or some aspect of its use and describe how you would apply a topic map to it. (3-5 pages, citations)
  3. How would you use a topic map to assist in the creation of a biomedical ontology? (3-5 pages, citations)

SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads

Filed under: Bioinformatics,Biomedical,Subject Identity — Patrick Durusau @ 6:54 pm

SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads Authors: Shruthi Prabhakara, Raj Acharya

Abstract:

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. In this paper, we present a two pass semi-supervised algorithm, SimComp, for soft clustering of short metagenome reads, that is a hybrid of comparative and composition based methods. In the first pass, a comparative analysis of the metagenome reads against BLASTx extracts the reference sequences from within the metagenome to form an initial set of seeded clusters. Those reads that have a significant match to the database are clustered by their phylogenetic provenance. In the second pass, the remaining fraction of reads are characterized by their species-specific composition based characteristics. SimComp groups the reads into overlapping clusters, each with its read leader. We make no assumptions about the taxonomic distribution of the dataset. The overlap between the clusters elegantly handles the challenges posed by the nature of the metagenomic data. The resulting cluster leaders can be used as an accurate estimate of the phylogenetic composition of the metagenomic dataset. Our method enriches the dataset into a small number of clusters, while accurately assigning fragments as small as 100 base pairs.

I cite this article for the proposition that subject identity may be a multi-pass thing. 😉

Seriously, as topic maps spread out we are going encounter any number of subject identity practices that don’t involve string match.

No only do we need to have passing familiarity but also the flexibility to incorporate the user’s expectations about subject identity into our topic maps.

Questions:

  1. Search on the phrase “metagenomic analysis software”.
  2. Become familiar with any one of the software packages listed.
  3. Of the techniques used by the software in #2, which one would you use in another context and why? (3-5 pages, no citations)

PS: I realize that some students have little or no interest in bioinformatics. The important lesson is learning to generalize the application of a technique in one area to its application in apparently dissimilar areas.

idk (I Don’t Know) – Ontology, Semantic Web – Cablegate

Filed under: Associations,Ontology,Roles,Semantic Web,Subject Identity,Topic Maps — Patrick Durusau @ 4:45 pm

While researching the idk (I Don’t Know) post I ran across the suggestion unknown was not appropriate for an ontology:

Good principles of ontological design state that terms should represent biological entities that actually exist, e.g., functional activities that are catalyzed by enzymes, biological processes that are carried out in cells, specific locations or complexes in cells, etc. To adhere to these principles the Gene Ontology Consortium has removed the terms, biological process unknown ; GO:0000004, molecular function unknown ; GO:0005554 and cellular component unknown ; GO:0008372 from the ontology.

The “unknown” terms violated this principle of sound ontological design because they did not represent actual biological entities but instead represented annotation status. Annotations to “unknown” terms distinguished between genes that were curated when no information was available and genes that were not yet curated (i.e., not annotated). Annotation status is now indicated by annotating to the root nodes, i.e. biological_process ; GO:0008150, molecular_function ; GO:0003674, or cellular_component ; GO:0005575. These annotations continue to signify that a given gene product is expected to have a molecular function, biological process, or cellular component, but that no information was available as of the date of annotation.

Adhering to principles of correct ontology design should allow GO users to take advantage of existing tools and reasoning methods developed by the ontological community. (http://www.geneontology.org/newsletter/archive/200705.shtml, 5 December 2010)

I wonder about the restriction, “…entities that actually exist.” means?

If a leak of documents occurs, a leaker exists, but in a topic map, I would say that was a role, not an individual.

If the unknown person is represented as an annotation to a role, how do I annotate such an annotation with information about the unknown/unidentified leaker?

Being unknown, I don’t think we can get that with an ontology, at least not directly.

Suggestions?

PS: A topic map can represent unknown functions, etc., as first class subjects (using topics) for an appropriate use case.

idk (I Don’t Know)

Filed under: Subject Identity,TMDM,Topic Maps,Uncategorized,XTM — Patrick Durusau @ 1:10 pm

What are you using to act as the placeholder for an unknown player of a role?

That is in say a news, crime or accident investigation, there is an association with specified roles, but only some facts and not the identity of all the players is known.

For example, in the recent cablegate case, when the story of the leaks broke, there was clearly an association between the leaked documents and the leaker.

The leaker had a number of known characteristics, the least of which was ready access to a wide range of documents. I am sure there were others.

To investigate that leak with a topic map, I would want to have a representative for the player of that role, to which I can assign properties.

I started to publish a subject identifier for the subject idk (I Don’t Know) to act as that placeholder but then thought it needs more discussion.

This has been in my blog queue for a couple of weeks so another week or so before creating a subject identifier won’t hurt.

The problem, which you already spotted, is that TMDM governed topic maps are going to merge topics with the idk (I Don’t Know) subject identifier. Which would in incorrect in many cases.

Interesting that it would not be wrong in all cases. That is I could have two associations, both of which have idk (I Don’t Know) subject identifiers and I want them to merge on the basis of other properties. So in that case the subject identifiers should merge.

I am leaning towards simply defining the semantics to be non-merger in the absence of merger on some other specified basis.

Suggestions?

PS: I kept writing the expansion idk (I Don’t Know) because a popular search engine suggested Insane Dutch Killers as the expansion. Wanted to avoid any ambiguity.

d.note: revising user interfaces through change tracking, annotations, and alternatives

Filed under: Authoring Topic Maps,Interface Research/Design — Patrick Durusau @ 8:22 am

d.note: revising user interfaces through change tracking, annotations, and alternatives Authors: Björn Hartmann, Sean Follmer, Antonio Ricciardi, Timothy Cardenas, Scott R. Klemmer

Abstract:

Interaction designers typically revise user interface prototypes by adding unstructured notes to storyboards and screen printouts. How might computational tools increase the efficacy of UI revision? This paper introduces d.note, a revision tool for user interfaces expressed as control flow diagrams. d.note introduces a command set for modifying and annotating both appearance and behavior of user interfaces; it also defines execution semantics so proposed changes can be tested immediately. The paper reports two studies that compare production and interpretation of revisions in d.note to freeform sketching on static images (the status quo). The revision production study showed that testing of ideas during the revision process led to more concrete revisions, but that the tool also affected the type and number of suggested changes. The revision interpretation study showed that d.note revisions required fewer clarifications, and that additional techniques for expressing revision intent could be beneficial. (There is a movie that accompanies this article as well.)

Designing/revising user interfaces is obviously relevant to the general task of creating topic maps software.

Questions:

  1. Pick a current topic map authoring tool and evaluate its user interface. (3-5 pages, no citations)
  2. Create a form for authoring topic map material in a particular domain.
  3. What are the strong/weak points of your proposal in #2? (3-5 pages, no citations)

December 4, 2010

Exploring Homology Using the Concept of Three-State Entropy Vector

Filed under: Bioinformatics,Biomedical,Data Mining — Patrick Durusau @ 3:24 pm

Exploring Homology Using the Concept of Three-State Entropy Vector Authors: Armando J. Pinho, Sara P. Garcia, Paulo J. S. G. Ferreira, Vera Afreixo, Carlos A. C. Bastos, António J. R. Neves, João M. O. S. Rodrigues Keywords: DNA signature, DNA coding regions, DNA entropy, Markov models

Abstract:

The three-base periodicity usually found in exons has been used for several purposes, as for example the prediction of potential genes. In this paper, we use a data model, previously proposed for encoding protein-coding regions of DNA sequences, to build signatures capable of supporting the construction of meaningful dendograms. The model relies on the three-base periodicity and provides an estimate of the entropy associated with each of the three bases of the codons. We observe that the three entropy values vary among themselves and also from species to species. Moreover, we provide evidence that this makes it possible to associate a three-state entropy vector with each species, and we show that similar species are characterized by similar three-state entropy vectors.

I include this paper both as informative for the bioinformatics crowd as well as to illustrate that subject identity tests are as varied as the subjects they identify. In this particular case, the identification of species for the construction of dendograms.

Probabilistic User Modeling in the Presence of Drifting Concepts

Probabilistic User Modeling in the Presence of Drifting Concepts Authors(s): Vikas Bhardwaj, Ramaswamy Devarajan

Abstract:

We investigate supervised prediction tasks which involve multiple agents over time, in the presence of drifting concepts. The motivation behind choosing the topic is that such tasks arise in many domains which require predicting human actions. An example of such a task is recommender systems, where it is required to predict the future ratings, given features describing items and context along with the previous ratings assigned by the users. In such a system, the relationships among the features and the class values can vary over time. A common challenge to learners in such a setting is that this variation can occur both across time for a given agent, and also across different agents, (i.e. each agent behaves differently). Furthermore, the factors causing this variation are often hidden. We explore probabilistic models suitable for this setting, along with efficient algorithms to learn the model structure. Our experiments use the Netflix Prize dataset, a real world dataset which shows the presence of time variant concepts. The results show that the approaches we describe are more accurate than alternative approaches, especially when there is a large variation among agents. All the data and source code would be made open-source under the GNU GPL.

Interesting because not only do concepts drift from user to user but modeling users as existing in neighborhoods of other users was more accurate than purely homogeneous or heterogeneous models.

Questions:

  1. If there is a “neighborhood” effect on users, what, if anything does that imply for co-occurrence of terms? (3-5 pages, no citations)
  2. How would you determine “neighborhood” boundaries for terms? (3-5 pages, citations)
  3. Do “neighborhoods” for terms vary by semantic domains? (3-5 pages, citations)

*****
Be aware that the Netflix dataset is no longer available. Possibly in response to privacy concerns. A demonstration of the utility of such concerns and their advocates.

Zoie: Real-time search indexing

Filed under: Full-Text Search,Indexing,Lucene,Search Engines,Software — Patrick Durusau @ 10:04 am

Zoie: Real-time search indexing

Somehow appropriate that following the lead on Kafka would lead me to Zoie (and other goodies to be reported).

From the website:

Zoie is a real-time search and indexing system built on Apache Lucene.

Donated by LinkedIn.com on July 19, 2008, and has been deployed in a real-time large-scale consumer website: LinkedIn.com handling millions of searches as well as hundreds of thousands of updates daily.

News: Zoie 2.0.0 is released … – Compatible with Lucene 2.9.x.

In a real-time search/indexing system, a document is made available as soon as it is added to the index. This functionality is especially important to time-sensitive information such as news, job openings, tweets etc.

Design Goals:

  • Additions of documents must be made available to searchers immediately
  • Indexing must not affect search performance
  • Additions of documents must not fragment the index (which hurts search performance)
  • Deletes and/or updates of documents must not affect search performance.

In topic map terms:

  • Additions to topic map must be made available to searchers immediately
  • Indexing must not affect search performance
  • Additions to topic map must not fragment the index (which hurts search performance)
  • Deletes and/or updates of a topic map must not affect search performance.

I would say that #’s 3 and 4 are research questions at this point.

Additions, updates and deletions in a topic map may have unforeseen (unforeseeable?) consequences.

Such as causing:

  • merging to occur
  • merging to be undone
  • roles to be played
  • roles to not be played
  • association to be valid
  • association to be invalid

to name only a few.

It may be possible to formally prove the impact that certain events will have but I am not aware of any definitive analysis on the subject.

Suggestions?

Kafka : A high-throughput distributed messaging system – Post

Filed under: Software,Topic Map Systems — Patrick Durusau @ 5:54 am

Kafka : A high-throughput distributed messaging system

Caught my eye:

Kafka is a distributed publish-subscribe messaging system. It is designed to support the following

  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
  • Support for parallel data load into Hadoop.

Depending on your message passing requirements for your topic map application, this could be of interest. Better to concentrate on the semantic heavy lifting than re-inventing message passing when solutions like this exist.

December 3, 2010

Dynamic Indexes?

I was writing the post about the New York Times graphics presentation when it occurred to me how close we are to dynamic indexes.

After all, gaming consoles are export restricted.

What we now consider to be “runs,” static indexes and the like are computational artifacts.

They follow how we created indexes when they were done by hand.

What happens when the properties of what is being indexed, its identifications and merging rules can change on the fly and re-present itself to the user for further manipulation?

I don’t think the fundamental issues of index construction get any easier with dynamic indexes but how we answer them will determine how quickly we can make effective use of such indexes.

Whether crossing the line first to dynamic indexes will be a competitive advantage, only time will tell.

I would like for some VC to be interested in finding out.

Caveat to VCs. If someone pitches this as making indexes more quickly, that isn’t the point. “Quick” and “dynamic” aren’t the same thing. Related but different. Keep both hands on your wallet.

Data Visualization Practices at the New York Times – Post

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 5:06 pm

Data Visualization Practices at the New York Times

Amanda Cox of the New York Times’ graphics department recently gave a great presentation to the New Media Days conference in Copenhagen and described how the Times uses data visualizations to reveal patterns, provide context, describe relationships, and even create a sense of wonder about the world.

Great post!

Detecting “Duplicates” (same subject?)

Filed under: Authoring Topic Maps,Duplicates,String Matching,Subject Identity — Patrick Durusau @ 4:43 pm

A couple of interesting posts from the LingPipe blog:

Processing Tweets with LingPipe #1: Search and CSV Data Structures

Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

The second one on duplicates being the one that caught my eye.

After all, what are merging conditions the in TMDM other than the detection of duplicates?

Of course, I am interested in TMDM merging but also in the detection of fuzzy subject identity.

Whether than is then represented by an IRI or kept as a native merging condition being an implementation type issue.

This could be very important for some future leak of diplomatic tweets. 😉

NoSQL Data Modeling

Filed under: Authoring Topic Maps,Database,Topic Maps — Patrick Durusau @ 4:06 pm

NoSQL Data Modeling

Alex Popescu emphasizes that data modeling is part and parcel of NoSQL database design.

Data modeling practice has something that topic maps practice does not: a wealth of material on data model patterns.

Rather I should say: subject identification patterns (which subjects to identify) and subject identity patterns (how to identify those subjects).

Both of which if developed and written out, could help with the topic map authoring process.

Constructions From Dots And Lines

Filed under: Data Structures,Graphs — Patrick Durusau @ 9:47 am

Constructions From Dots And Lines. Authors: Marko A. Rodriguez, Peter Neubauer

Abstract:

A graph is a data structure composed of dots (i.e. vertices) and lines (i.e. edges). The dots and lines of a graph can be organized into intricate arrangements. The ability for a graph to denote objects and their relationships to one another allow for a surprisingly large number of things to be modeled as a graph. From the dependencies that link software packages to the wood beams that provide the framing to a house, most anything has a corresponding graph representation. However, just because it is possible to represent something as a graph does not necessarily mean that its graph representation will be useful. If a modeler can leverage the plethora of tools and algorithms that store and process graphs, then such a mapping is worthwhile. This article explores the world of graphs in computing and exposes situations in which graphical models are beneficial.

A relatively short (11 pages) and entertaining (it takes all kinds you know) treatment of graphs and their properties.

The depiction of the types of graphs and the possibility of combining types of graphs in figure 2 is worth downloading the article but please read it in full.

Questions:

  1. Evaluate the graph types for suitability representing topic maps.
  2. Which graph type looks like the best fit for topic maps? (3-5 pages, no citations)
  3. Which graph type looks like the worst fit for topic maps? (3-5 pages, no citations)

Neo4j 1.2 Milestone 5 – Reference Manual and HA! – Post (Protends for TMQL?)

Filed under: Graphs,Neo4j,Query Language,TMQL — Patrick Durusau @ 9:27 am

Neo4J 1.2 Milestone 5 – Reference Manual and HA!

News of the release of a reference manual for Neo4j and a High Availability option (the HA in the title).

I know it is a reference manual but I was disappointed there was no mention of topic maps.

Surprising I know but it still happens. 😉

Guess I need to try to find the cycles to generate, collaborate on, etc., some documentation that can be posted to the topic maps community for review.

Assuming it passes muster there, it can be passed along to the Neo4j project.

BTW, I found a “related” article listed for Neo4j that starts off:

A multi-relational graph maintains two or more relations over a vertex set. This article defines an algebra for traversing such graphs that is based on an $n$-ary relational algebra, a concatenative single-relational path algebra, and a tensor-based multi-relational algebra. The presented algebra provides a monoid, automata, and formal language theoretic foundation for the construction of a multi-relational graph traversal engine.

Can’t you just hear Robert saying that with a straight face? 😉

Seriously, if we are going to compete with enterprise grade solutions, that is the level of thinking that needs to underlie TMQL.

It is going to require effort on all our parts but “good enough” solutions aren’t and should not be supported.

Declared Instance Inferences (DI2)? (RDF, OWL, Semantic Web)

Filed under: Inference,OWL,RDF,Semantic Web,Subject Identity — Patrick Durusau @ 8:49 am

In recent discussions of identity, I have seen statements that OWL reasoners could infer that two or more representatives stood for the same subject.

That’s useful but I wondered if the inferencing overhead is necessary in all in such cases?

If a user recognizes that a subject representative (a subject proxy in topic map terms) represents the same subject as another representative, a declarative statement avoids the need for artificial inferencing.

I am sure there are cases where inferencing is useful, particularly to suggest inferences to users, but declared inferences could reduce that need and the overhead.

Declarative information artifacts could be created that contain rules for known identifications.

For example, gene names found in PubMed. If two or more names are declared to refer to the same gene, where is the need for inferencing?

With such declarations in place, no reasoner has to “infer” anything about those names.

Declared instance inferences (DI2) reduce semantic dissonance, inferencing overhead and uncertainty.

Looks like a win-win situation to me.

*****
PS: It occurs to me that ontologies are also “declared instance inferences” upon which artificial reasoners rely. The instances happen to be classes and not individuals.

S4

S4

From the website:

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

Just in case you were wondering if topic maps are limited to being bounded objects composed of syntax. No.

Questions:

  1. Specify three sources of unbounded streams of data. (3 pages, citations)
  2. What subjects would you want to identify and on what basis in any one of them? (3-5 pages, citations)
  3. What other information about those subjects would you want to bind to the information in #2? What subject identity tests are used for those subjects in other sources? (5-10 pages, citations)

December 2, 2010

Building Concept Structures/Concept Trails

Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems Authors: Christian Biemann, Karsten Böhm, Gerhard Heyer and Ronny Melz

Abstract:

The automated creation and the visualization of concept structures become more important as the number of relevant information continues to grow dramatically. Especially information and knowledge intensive tasks are relying heavily on accessing the relevant information or knowledge at the right time. Moreover the capturing of relevant facts and good ideas should be focused on as early as possible in the knowledge creation process.

In this paper we introduce a technology to support knowledge structuring processes already at the time of their creation by building up concept structures in real time. Our focus was set on the design of a minimal invasive system, which ideally requires no human interaction and thus gives the maximum freedom to the participants of a knowledge creation or exchange processes. The initial prototype concentrates on the capturing of spoken language to support meetings of human experts, but can be easily adapted for the use in Internet communities that have to rely on knowledge exchange using electronic communication channel.

I don’t share the author’s confidence that corpus linguistics are going to provide the level of accuracy expected.

But, I find the notion of a dynamic semantic map that grows, changes and evolves during a discussion to be intriguing.

This article was published in 2006 so I will follow up to see what later results have been reported.

« Newer PostsOlder Posts »

Powered by WordPress