Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 18, 2011

Hadoop Basics – Post

Filed under: Hadoop,MapReduce,Subject Identity — Patrick Durusau @ 9:27 pm

Hadoop Basics by Carlo Scarioni illustrates the basics of using Hadoop.

When you read the blog post you will know why I selected his post over any number of others.

Questions:

  1. Perform the exercise and examine the results. How accurate are they?
  2. How would you improve the accuracy?
  3. How would you have to modify the Hadoop example to use your improvements in #2?

January 15, 2011

How to Think about Parallel Programming: Not!

Filed under: Language Design,Parallel Programming,Subject Identity — Patrick Durusau @ 5:04 pm

How to Think about Parallel Programming: Not! by Guy Steele is a deeply interesting presentation on how not to approach parallel programming. The central theme is that languages should provide parallelism transparently, without programmers having to think in parallel.

Parallel processing of topic maps is another way to scale topic map for particular situations.

How to parallel process questions of subject identity is an open and possibly domain specific issue.

Watch the presentation even if you are only seeking an entertaining account of my first program.

January 8, 2011

Principal Programming Paradigms

Filed under: Semantics,Subject Identity,Visualization — Patrick Durusau @ 9:35 am

Principal Programming Paradigms is an outgrowth of Concepts, Techniques, and Models of Computer Programming
, which is highly recommended for your reading pleasure.

I don’t think there are any universal threads running through semantic technologies but that may simply be my view of those technologies.

Questions:

  1. List with definitions your list of semantic technologies (3-5 pages, citations)
  2. What characteristics do you think form useful classifications of those technologies? (3-5 pages, citations)
  3. How would you visualize those technologies? (neatness counts)

January 4, 2011

Algorithms – Lecture Notes

Filed under: CS Lectures,String Matching,Subject Identity — Patrick Durusau @ 7:51 am

Algorithms, Jeff Erickson’s lecture notes.

Mentioned in a post on the Theoretical Computer Science blog, What Lecture Notes Should Everyone Read?.

From the introduction:

Despite several rounds of revision, these notes still contain lots of mistakes, errors, bugs, gaffes, omissions, snafus, kludges, typos, mathos, grammaros, thinkos, brain farts, nonsense, garbage, cruft, junk, and outright lies, all of which are entirely Steve Skiena’s fault. I revise and update these notes every time I teach the course, so please let me know if you find a bug. (Steve is unlikely to care.)

The notes are highly amusing and useful to anyone seeking to improve current subject identification (read searching) practices.

January 3, 2011

Vowpal Wabbit (Fast Online Learning)

Filed under: Machine Learning,Subject Identity — Patrick Durusau @ 4:01 pm

Vowpal Wabbit (Fast Online Learning) by John Langford.

From the website:

There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation.

I rather like that!

Suspect the same is true for subject identity recognition algorithms.

People have fast ones that require little or no programming. 😉

What it will take to replicate such intrinsically fast subject recognition algorithms in digital form remains a research question.

December 31, 2010

The Brainy Learning Algorithms of Numenta

Filed under: Subject Identity — Patrick Durusau @ 11:59 am

The Brainy Learning Algorithms of Numenta

From the lead:

Jeff Hawkins has a track record at predicting the future. The founder of Palm and inventor of the PalmPilot, he spent the 1990s talking of a coming world in which we would all carry powerful computers in our pockets. “No one believed in it back then—people thought I was crazy,” he says. “Of course, I’m thrilled about how successful mobile computing is today.”

At his current firm, Numenta, Hawkins is working on another idea that seems to come out of left field: copying the workings of our own brains to build software that makes accurate snap decisions for today’s data-deluged businesses. He and his team have been working on their algorithms since 2005 and are finally preparing to release a version that is ready to be used in products. Numenta’s technology is aimed at a variety of applications, such as judging whether a credit card transaction is fraudulent, anticipating what a Web user will click next, or predicting the likelihood that a particular hospital patient will suffer a relapse.

In topic maps lingo, I would say the algorithms are developing parameters for subject recognition.

It would be interesting to see the development of parameters for subject recognition that could be sold or leased. As artifacts separate from software.

As far as I know, all searching software develops its own view from scratch, which seems pretty wasteful. Not to mention obtaining results of varying utility.

Questions:

  1. Are there any search engines or appliances that don’t start from scratch when indexing? (1-2 pages, citations)
  2. What issues do current search engines present to the addition of subject recognition rules, data or results (from other software)? (3-5 pages, citations)
  3. What would you add to current search engines? Rules, results from other engines? Why? (3-5 pages, citations)

December 30, 2010

Inductive Logic Programming (and Martian Identifications)

Filed under: Bayesian Models,Inductive Logic Programming (ILP),Subject Identity — Patrick Durusau @ 4:44 pm

Inductive Logic Programming: Theory and Methods Authors: Stephen Muggleton, Luc De Raedt

Abstract:

Inductive Logic Programming (ILP) is a new discipline which investigates the inductive construction of first-order clausal theories from examples and background knowledge. We survey the most important theories and methods of this new eld. Firstly, various problem specifications of ILP are formalised in semantic settings for ILP, yielding a “model-theory” for ILP. Secondly, a generic ILP algorithm is presented. Thirdly, the inference rules and corresponding operators used in ILP are presented, resulting in a “proof-theory” for ILP. Fourthly, since inductive inference does not produce statements which are assured to follow from what is given, inductive inferences require an alternative form of justification. This can take the form of either probabilistic support or logical constraints on the hypothesis language. Information compression techniques used within ILP are presented within a unifying Bayesian approach to confirmation and corroboration of hypotheses. Also, different ways to constrain the hypothesis language, or specify the declarative bias are presented. Fifthly, some advanced topics in ILP are addressed. These include aspects of computational learning theory as applied to ILP, and the issue of predicate invention. Finally, we survey some applications and implementations of ILP. ILP applications fall under two different categories: firstly scientific discovery and knowledge acquisition, and secondly programming assistants.

A good survey of Inductive Logic Programming (ILP) if a bit dated. Feel free to suggest more recent surveys of the area.

As I mentioned under Mining Travel Resources on the Web Using L-Wrappers, the notion of interpretative domains is quite interesting.

I suspect, but cannot prove (at least at this point), that most useful mappings exist between closely related interpretative domains.

Closely related interpretative domains being composed of identifications of a subject that I will quickly recognize as alternative identifications.

Showing me a mapping that includes a Martian identification of my subject, which is not a closely related interpretative domain is unlikely to be useful, at least to me. (I can’t speak for any potential Martians.)

Graph 500

Filed under: Graphs,Subject Identity,Topic Maps — Patrick Durusau @ 11:56 am

Graph 500

From the website:

Data intensive supercomputer applications are increasingly important HPC workloads, but are ill-suited for platforms designed for 3D physics simulations. Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads.

Backed by a steering committee of over 30 international HPC experts from academia, industry, and national laboratories, Graph 500 will establish a set of large-scale benchmarks for these applications. The Graph 500 steering committee is in the process of developing comprehensive benchmarks to address three application kernels: concurrent search, optimization (single source shortest path), and edge-oriented (maximal independent set). Further, we are in the process of addressing five graph-related business areas: Cybersecurity, Medical Informatics, Data Enrichment, Social Networks, and Symbolic Networks.

This is the first serious approach to complement the Top 500 with data intensive applications. Additionally, we are working with the SPEC committee to include our benchmark in their CPU benchmark suite. We anticipate the list will rotate between ISC and SC in future years.

What drew my attention to this site was the following quote in the IEEE article, Better Benchmarking for Supercomputers by Mark Anderson:

An “edge” here is a connection between two data points. For instance, when you buy Michael Belfiore’s Department of Mad Scientists from Amazon.com, one edge is the link in Amazon’s computer system between your user record and the Department of Mad Scientists database entry. One necessary but CPU-intensive job Amazon continually does is to draw connections between edges that enable it to say that 4 percent of customers who bought Belfiore’s book also bought Alex Abella’s Soldiers of Reason and 3 percent bought John Edwards’s The Geeks of War.

Within Amazon’s system, intensive but, what if someone, say the U.S. government, wanted to map Amazon data to information it holds in various systems?

Can you say subject identity?

December 26, 2010

XML Schema Element Similarity Measures: A Schema Matching Context

Filed under: Similarity,Subject Identity,Topic Maps — Patrick Durusau @ 4:33 pm

XML Schema Element Similarity Measures: A Schema Matching Context Authors: Alsayed Algergawy, Richi Nayak, Gunter Saake

Abstract:

In this paper, we classify, review, and experimentally compare major methods that are exploited in the definition, adoption, and utilization of element similarity measures in the context of XML schema matching. We aim at presenting a unified view which is useful when developing a new element similarity measure, when implementing an XML schema matching component, when using an XML schema matching system, and when comparing XML schema matching systems.

I commend the entire paper for your reading but would draw your attention to one of the conclusions in particular:

Using a single element similarity measure is not sufficient to assess the similarity between XML schema elements. This necessitates the need to utilize several element measures exploiting both internal element features and external element relationships.

Does it seem plausible that single subject similarity measures can work but it is better to use several subject similarity measures?

Questions:

  1. Compare this paper to any recent (last two years) paper on database schema similarity. What issues are the same, different, similar? (sorry, could not think of another word for it) (2-3 pages, citations)
  2. Create an annotated bibliography of ten (10) recent papers on XML or database schema similarity (excluding the papers in #1). (4-6 pages, citations)
  3. How would you use any of the similarity measures you have read about in a topic map? Or is similarity enough? (3-5 pages, no citations)

December 18, 2010

KNIME Version 2.3.0 released – News

Filed under: Heterogeneous Data,Mapping,Software,Subject Identity — Patrick Durusau @ 12:48 pm

KNIME Version 2.3.0 released

From the announcement:

The new version is a greatly enhancing the usability of KNIME. It adds new features like workflow annotations, support for hotkeys, inclusion of R-views in reports, data flow switches, option to hide node labels, variable support in the database reader/connector and R-nodes, and the ability to export KNIME workflows as SVG Graphics.

With the 2.3 release we are also introducing a community node repository, which includes KNIME extensions for bio- and chemoinformatics and an advanced R-scripting environment.

Data trails reconstruction at the community level in the Web of data – Presentation

Filed under: Co-Words,Data Mining,Subject Identity — Patrick Durusau @ 9:30 am

David Chavalarias: Video from SOKS: Self-Organising Knowledge Systems, Amsterdam, 29 April 2010.

Abstract:

Socio-semantic networks continuously produce data over the Web in a time consistent manner. From scientific communities publishing new findings in archives to citizens confronting their opinions in blogs, there is a real challenge to reconstruct, at the community level, the data trails they produce in order to have a global representation of the topics unfolding in these public arena. We will present such methods of reconstruction in the framework of co-word analysis, highlighting perspectives for the development of innovative tools for our daily interactions with their productions.

I wasn’t able to get very good sound quality for this presentation and there were no slides. However, I was interested enough to find the author’s home page: David Chavalarias and a wealth of interesting material.

I will be watching his projects for some very interesting results and suggest that you do the same.

December 14, 2010

NKE: Navigational Knowledge Engineering

Filed under: Authoring Topic Maps,Ontology,Subject Identity,Topic Maps — Patrick Durusau @ 5:36 pm

NKE: Navigational Knowledge Engineering

From the website:

Although structured data is becoming widely available, no other methodology – to the best of our knowledge – is currently able to scale up and provide light-weight knowledge engineering for a massive user base. Using NKE, data providers can publish flat data on the Web without extensively engineering structure upfront, but rather observe how structure is created on the fly by interested users, who navigate the knowledge base and at the same time also benefit from using it. The vision of NKE is to produce ontologies as a result of users navigating through a system. This way, NKE reduces the costs for creating expressive knowledge by disguising it as navigation. (emphasis in original)

This methodology may or may not succeed but it demonstrates a great deal of imagination.

Now imagine a similar concept but built around subject identity.

Where known ambiguities offer a user a choice of subjects to identify.

Or where there are different ways to identify a subject. The harder case.

Questions:

  1. Read the paper/run the demo. Comments, suggestions? (3-5 pages, no citations)
  2. How would you adapt this approach to the identification of subjects? (3-5 pages, no citations)
  3. What data set would you suggest for a test case using the technique you describe in #2? Why is that data set a good test? (3-5 pages, pointers to the data set)

December 13, 2010

10×10 – Words and Photos

Filed under: Data Source,Dataset,Subject Identity — Patrick Durusau @ 7:38 am

10×10

From the website:

10×10 (‘ten by ten’) is an interactive exploration of words and photos based on news stories during a particular hour. The 10×10 site displays 100 photos, each photo representative of a word used in many news stories published during the current hour. The 10×10 site maintains an archive of these photos and words back to 2004. The 10×10 API is organized like directories, with the year, month, day and hour. Retrieve the words list for a particular hour, then get the photos that correspond to those words.

A preset mapping of words to photos but nothing would prevent an application from offering additional photos.

Not to mention enabling merging based on the recognition of photos.*

Replication of merging could be an issue if based on image recognition.

On the other hand, I am not sure replication of merging would be any less certain than asking users to base merging decisions based on textual content.

Reliable replication of merging is possible only when our mechanical servants are given rules to apply.

****
*Leaving aside replication of merging issues (which may not be an operational requirement), facial recognition, perhaps supplemented by human operator confirmation, could be an interesting component of mass acquisition of images, say at border entry/exit points.

Not that border personnel need be given access to such information, a la Secret Intelligence – Public Recording Network (SIPRNet) systems, but a topic map could simply signal an order to detain, follow, get their phone number.

Simply dumping data into systems doesn’t lead to more “connect the dot” moments.

Topic maps may be a way to lead to more such moments, depending upon the skill of their construction and your analysts. (inquiries welcome)

December 10, 2010

Facets and “Undoable” Merges

After writing Identifying Subjects with Facets, I started thinking about the merge of the subjects matching a set of facets. So the user could observe all the associations where the members of that subject participated.

If merger is a matter of presentation to the user, then the user should be able to remove one of the members that makes up a subject from the merge. Which results in the removal of associations where that member of the subject participated.

No more or less difficult than the inclusion/exclusion based on the facets, except this time it involves removal on the basis of roles in associations. That is the playing of a role, being a role, etc. are treated as facets of a subject.

Well, except that an individual member of a collective subject is being manipulated.

This capability would enable a user to manipulate what members of a subject are represented in a merge. Not to mention being able to unravel a merge one member of a subject at a time.

An effective visual representation of such a capability could be quite stunning.

Identifying Subjects With Facets

If facets are aspects of subjects, then for every group of facets, I am identifying the subject that has those facets.

If I have the facets, height, weight, sex, age, street address, city, state, country, email address, then at the outset, my subject is the subject that has all those characteristics, with whatever value.

We could call that subject: people.

Not the way I usually think about it but follow the thought out a bit further.

For each facet where I specify a value, the subject identified by the resulting value set is both different from the starting subject and, more importantly, has a smaller set of members in the data set.

Members that make up the collective that is the subject we have identified.

Assume we have narrowed the set of people down to a group subject that has ten members.

Then, we select merge from our application and it merges these ten members.

Sounds damned odd, to merge what we know are different subjects?

What if by merging those different members we can now find these different individuals have a parent association with the same children?

Or have a contact relationship with a phone number associated with an individual or group of interest?

Robust topic map applications will offer users the ability to navigate and explore subject identities.

Subject identities that may not always be the ones you expect.

We don’t live in a canned world. Does your semantic software?

December 9, 2010

Foundations of Computer Science

Filed under: Subject Identity,TMRM,Topic Maps — Patrick Durusau @ 11:55 am

Foundations of Computer Science

Introduction to theory in computer science by Alfred V. Aho and Jeffrey D. Ullman. (Free PDF of the entire text)

The turtle on the cover is said to be a reference to the turtle on which the world rests.

This particular turtle serves as the foundation for:

I point out this work because of its emphasis on abstraction.

Topic maps, at their best, are abstractions that bridge other abstractions and make use of information recorded in those abstractions.

*****
PS: The “rules of thumb” for programming in the introduction are equally applicable to writing topic maps. You will not encounter many instances of them being applied but they remain good guidance.

December 8, 2010

Aspects of Topic Maps

Writing about Bobo: Fast Faceted Search With Lucene, made me start to think about the various aspects of topic maps.

Authoring of topic maps is something that was never discussed in the original HyTime based topic map standard and despite several normative syntaxes, mostly even now it is either you have a topic map, or you don’t. Depending upon your legend.

Which is helpful given the unlimited semantics that can be addressed with topic maps but looks awfully hand-wavy to, ahem, outsiders.

Subject Identity or should I say: when two subject representatives are deemed for some purpose to represent the same subject. (That’s clearer. ;-)) This lies at the heart of topic maps and the rest of the paradigm supports or is consequences of this principle.

There is no one way to identify any subject and users should be free to use the identification that suits them best. Where subjects include the data structures that we build for users. Yes, IT doesn’t get to dictate what subjects can be identified or how. (Probably should have never been the case but that is another issue.)

Merging of subject representatives. Merging is an aspect of recognizing two or more subject representatives represent the same subject. What happens then is implementation, data model and requirement specific.

A user may wish to see separate representatives just prior to merger so merging can be audited or may wish to see only merged representatives for some subset of subjects or may have other requirements.

Interchange of topic maps. Not exclusively the domain of syntaxes/data models but an important purpose for them. It is entirely possible to have topic maps for which no interchange is intended or desirable. Rumor has it of the topic maps at the Y-12 facility at Oak Ridge for example. Interchange was not their purpose.

Navigation of the topic map. The post that provoked this one is a good example. I don’t need specialized or monolithic software to navigate a topic map. It hampers topic map development to suggest otherwise.

Querying topic maps. Topic maps have been slow to develop a query language and that effort has recently re-started. Graph query languages, that are already fairly mature, may be sufficient for querying topic maps.

Given the diversity of subject identity semantics, I don’t foresee a one size fits all topic maps query language.

Interfaces for topic maps. However one resolves/implements other aspects of topic maps, due regard has to be paid to the issue of interfaces. Efforts thus far range from web portals to “look its a topic map!” type interface.

In the defense of current efforts, human-computer interfaces are poorly understood. Not surprising since the human-codex interface isn’t completely understood and we have been working at that one considerably longer.

Questions:

  1. What other aspects to topic maps would you list?
  2. Would you sub-divide any of these aspects? If so, how?
  3. What suggestions do you have for one or more of these aspects?

December 7, 2010

Bobo: Fast Faceted Search With Lucene

Filed under: Facets,Information Retrieval,Lucene,Navigation,Subject Identity — Patrick Durusau @ 8:52 pm

Bobo: Fast Faceted Search With Lucene

From the website:

Bobo is a Faceted Search implementation written purely in Java, an extension of Apache Lucene.

While Lucene is good with unstructured data, Bobo fills in the missing piece to handle semi-structured and structured data.

Bobo Browse is an information retrieval technology that provides navigational browsing into a semi-structured dataset. Beyond the result set from queries and selections, Bobo Browse also provides the facets from this point of browsing.

Features:

  • No need for cache warm-up for the system to perform
  • multi value sort – sort documents on fields that have multiple values per doc, .e.g tokenized fields
  • fast field value retrieval – over 30x faster than IndexReader.document(int docid)
  • facet count distribution analysis
  • stable and small memory footprint
  • support for runtime faceting
  • result merge library for distributed facet search

I had to go look up the definition of facet. Merriam-Webster (I remember when it was just Webster) says:

any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)

So a faceted search could search/browse, in theory at any rate, based on any property of a subject, even those I don’t recognize.

Different languages being the easiest example.

I could have aspects of a hotel room described in both German and Korean, both describing the same facets of the room.

Questions:

  1. How would you choose the facets for a subject to be included in faceted browsing? (3-5 pages, no citations)
  2. How would you design and test the presentation of facets to users? (3-5 pages, no citations)
  3. Compare the current TMQL proposal (post-Barta) with the query language for facet searching. If a topic map were treated (post-merging) as faceted subjects, which one would you prefer and why? (3-5 pages, no citations)

A Library Case For Topic Maps

Filed under: Classification,Examples,Subject Identity,Topic Maps — Patrick Durusau @ 7:40 pm

Libraries would benefit from topic maps in a number of ways but I ran across a very specific one today.

To escape the paralyzing grip of library vendors, a number of open source projects for system, even state-wide library software projects are now underway.

OK, so you have a central registry of all the books. But the local libraries, have millions of books with call numbers already assigned.

Libraries can either spend years and $millions to transition to uniform identifiers (doesn’t that sound “webby” to you?) or they can keep the call number they have.

Here is a real life example of the call numbers for Everybody’s Plutarch:

920 PLU

920.3 P

920 PLUT

R 920 P

920 P

Solution? One record (can you say proxy?) for this book with details for the various individual library holdings.

Libraries are already doing this so what is the topic map payoff?

Say I write a review of Everybody’s Plutarch and post it to the local library system with call number 920 P.

With a topic map, users of the system with 920.3 P (or any of the others), will also see my review.

The topic map payoff is that we can benefit from the contributions of others as well as contribute ourselves.

(Without having to move in mental lock step.)

December 6, 2010

KissKissBan

KissKissBan: A Competitive Human Computation Game for Image Annotation Authors: Chien-Ju Ho, Tao-Hsuan Chang, Jong-Chuan Lee, Jane Yung-jen Hsu, Kuan-Ta Chen Keywords: Amazon Mechanical Turk, ESP Game, Games With A Purpose, Human Computation, Image Annotation

Abstract:

In this paper, we propose a competitive human computation game, KissKissBan (KKB), for image annotation. KKB is different from other human computation games since it integrates both collaborative and competitive elements in the game design. In a KKB game, one player, the blocker, competes with the other two collaborative players, the couples; while the couples try to find consensual descriptions about an image, the blocker’s mission is to prevent the couples from reaching consensus. Because of its design, KKB possesses two nice properties over the traditional human computation game. First, since the blocker is encouraged to stop the couples from reaching consensual descriptions, he will try to detect and prevent coalition between the couples; therefore, these efforts naturally form a player-level cheating-proof mechanism. Second, to evade the restrictions set by the blocker, the couples would endeavor to bring up a more diverse set of image annotations. Experiments hosted on Amazon Mechanical Turk and a gameplay survey involving 17 participants have shown that KKB is a fun and efficient game for collecting diverse image annotations.

This article makes me wonder about the use of “games” for the construction of topic maps?

I don’t know of any theoretical reason why topic map construction has to resemble a visit to the dentist office. 😉

Or for that matter, why does a user needs to know they are authoring/using a topic map at all?

Questions:

  1. What other game or game like scenario’s do you think lend themselves to the creation of online content? (3-5 pages, citations)
  2. What type of information do you think users could usefully contribute to a topic map (whether known to be a topic map or not)? (3-5 pages, no citations)
  3. Sketch out a proposal for an online game that adds information, focusing on incentives and the information contributed. (3-5 pages, no citations)

A Brief Survey on Sequence Classification

Filed under: Data Mining,Pattern Recognition,Sequence Classification,Subject Identity — Patrick Durusau @ 5:56 am

A Brief Survey on Sequence Classification Authors: Zhengzheng Xing, Jian Pei, Eamonn Keogh

Abstract:

Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.

Excellent survey article on sequence classification, which as the authors note, is a rapidly developing field of research.

This article was published in the “newsletter” of the ACM Special Interest Group on Knowledge Discovery and Data Mining. Far more substantive material than I am accustomed to seeing in any “newsletter.”

The ACM has very attractive student discounts and if you are serious about being an information professional, it is one of the organizations that I would recommend in addition to the usual library suspects.

December 5, 2010

SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads

Filed under: Bioinformatics,Biomedical,Subject Identity — Patrick Durusau @ 6:54 pm

SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads Authors: Shruthi Prabhakara, Raj Acharya

Abstract:

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. In this paper, we present a two pass semi-supervised algorithm, SimComp, for soft clustering of short metagenome reads, that is a hybrid of comparative and composition based methods. In the first pass, a comparative analysis of the metagenome reads against BLASTx extracts the reference sequences from within the metagenome to form an initial set of seeded clusters. Those reads that have a significant match to the database are clustered by their phylogenetic provenance. In the second pass, the remaining fraction of reads are characterized by their species-specific composition based characteristics. SimComp groups the reads into overlapping clusters, each with its read leader. We make no assumptions about the taxonomic distribution of the dataset. The overlap between the clusters elegantly handles the challenges posed by the nature of the metagenomic data. The resulting cluster leaders can be used as an accurate estimate of the phylogenetic composition of the metagenomic dataset. Our method enriches the dataset into a small number of clusters, while accurately assigning fragments as small as 100 base pairs.

I cite this article for the proposition that subject identity may be a multi-pass thing. 😉

Seriously, as topic maps spread out we are going encounter any number of subject identity practices that don’t involve string match.

No only do we need to have passing familiarity but also the flexibility to incorporate the user’s expectations about subject identity into our topic maps.

Questions:

  1. Search on the phrase “metagenomic analysis software”.
  2. Become familiar with any one of the software packages listed.
  3. Of the techniques used by the software in #2, which one would you use in another context and why? (3-5 pages, no citations)

PS: I realize that some students have little or no interest in bioinformatics. The important lesson is learning to generalize the application of a technique in one area to its application in apparently dissimilar areas.

idk (I Don’t Know) – Ontology, Semantic Web – Cablegate

Filed under: Associations,Ontology,Roles,Semantic Web,Subject Identity,Topic Maps — Patrick Durusau @ 4:45 pm

While researching the idk (I Don’t Know) post I ran across the suggestion unknown was not appropriate for an ontology:

Good principles of ontological design state that terms should represent biological entities that actually exist, e.g., functional activities that are catalyzed by enzymes, biological processes that are carried out in cells, specific locations or complexes in cells, etc. To adhere to these principles the Gene Ontology Consortium has removed the terms, biological process unknown ; GO:0000004, molecular function unknown ; GO:0005554 and cellular component unknown ; GO:0008372 from the ontology.

The “unknown” terms violated this principle of sound ontological design because they did not represent actual biological entities but instead represented annotation status. Annotations to “unknown” terms distinguished between genes that were curated when no information was available and genes that were not yet curated (i.e., not annotated). Annotation status is now indicated by annotating to the root nodes, i.e. biological_process ; GO:0008150, molecular_function ; GO:0003674, or cellular_component ; GO:0005575. These annotations continue to signify that a given gene product is expected to have a molecular function, biological process, or cellular component, but that no information was available as of the date of annotation.

Adhering to principles of correct ontology design should allow GO users to take advantage of existing tools and reasoning methods developed by the ontological community. (http://www.geneontology.org/newsletter/archive/200705.shtml, 5 December 2010)

I wonder about the restriction, “…entities that actually exist.” means?

If a leak of documents occurs, a leaker exists, but in a topic map, I would say that was a role, not an individual.

If the unknown person is represented as an annotation to a role, how do I annotate such an annotation with information about the unknown/unidentified leaker?

Being unknown, I don’t think we can get that with an ontology, at least not directly.

Suggestions?

PS: A topic map can represent unknown functions, etc., as first class subjects (using topics) for an appropriate use case.

idk (I Don’t Know)

Filed under: Subject Identity,TMDM,Topic Maps,Uncategorized,XTM — Patrick Durusau @ 1:10 pm

What are you using to act as the placeholder for an unknown player of a role?

That is in say a news, crime or accident investigation, there is an association with specified roles, but only some facts and not the identity of all the players is known.

For example, in the recent cablegate case, when the story of the leaks broke, there was clearly an association between the leaked documents and the leaker.

The leaker had a number of known characteristics, the least of which was ready access to a wide range of documents. I am sure there were others.

To investigate that leak with a topic map, I would want to have a representative for the player of that role, to which I can assign properties.

I started to publish a subject identifier for the subject idk (I Don’t Know) to act as that placeholder but then thought it needs more discussion.

This has been in my blog queue for a couple of weeks so another week or so before creating a subject identifier won’t hurt.

The problem, which you already spotted, is that TMDM governed topic maps are going to merge topics with the idk (I Don’t Know) subject identifier. Which would in incorrect in many cases.

Interesting that it would not be wrong in all cases. That is I could have two associations, both of which have idk (I Don’t Know) subject identifiers and I want them to merge on the basis of other properties. So in that case the subject identifiers should merge.

I am leaning towards simply defining the semantics to be non-merger in the absence of merger on some other specified basis.

Suggestions?

PS: I kept writing the expansion idk (I Don’t Know) because a popular search engine suggested Insane Dutch Killers as the expansion. Wanted to avoid any ambiguity.

December 3, 2010

Detecting “Duplicates” (same subject?)

Filed under: Authoring Topic Maps,Duplicates,String Matching,Subject Identity — Patrick Durusau @ 4:43 pm

A couple of interesting posts from the LingPipe blog:

Processing Tweets with LingPipe #1: Search and CSV Data Structures

Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

The second one on duplicates being the one that caught my eye.

After all, what are merging conditions the in TMDM other than the detection of duplicates?

Of course, I am interested in TMDM merging but also in the detection of fuzzy subject identity.

Whether than is then represented by an IRI or kept as a native merging condition being an implementation type issue.

This could be very important for some future leak of diplomatic tweets. 😉

Declared Instance Inferences (DI2)? (RDF, OWL, Semantic Web)

Filed under: Inference,OWL,RDF,Semantic Web,Subject Identity — Patrick Durusau @ 8:49 am

In recent discussions of identity, I have seen statements that OWL reasoners could infer that two or more representatives stood for the same subject.

That’s useful but I wondered if the inferencing overhead is necessary in all in such cases?

If a user recognizes that a subject representative (a subject proxy in topic map terms) represents the same subject as another representative, a declarative statement avoids the need for artificial inferencing.

I am sure there are cases where inferencing is useful, particularly to suggest inferences to users, but declared inferences could reduce that need and the overhead.

Declarative information artifacts could be created that contain rules for known identifications.

For example, gene names found in PubMed. If two or more names are declared to refer to the same gene, where is the need for inferencing?

With such declarations in place, no reasoner has to “infer” anything about those names.

Declared instance inferences (DI2) reduce semantic dissonance, inferencing overhead and uncertainty.

Looks like a win-win situation to me.

*****
PS: It occurs to me that ontologies are also “declared instance inferences” upon which artificial reasoners rely. The instances happen to be classes and not individuals.

November 24, 2010

TMRM and a “universal information space”

Filed under: Subject Identity,TMDM,TMRM,Topic Maps — Patrick Durusau @ 7:58 pm

As an editor of the TMRM (Topic Maps Reference Model) I feel compelled to point out the TMRM is not a universal information space.

I bring up the universal issue because someone mentioned lately, mapping to the TMRM.

There is a lot to say about the TMRM but let’s start with the mapping issue.

There is no mapping to the TMRM. (full stop) The reason is that the TMRM is also not a data model. (full stop)

There is a simple reason why the TMRM was not, is not, nor ever will be a data model or universal information space.

There is no universal information space or data model.

Data models are an absolute necessity and more will be invented tomorrow.

But, to be a data model is to govern some larger or smaller slice of data.

We want to meaningfully access information across past, present and future data models in different information spaces.

Enter the TMRM, a model for disclosure of the subjects represented by a data model. Any data model, in any information space.

A model for disclosure, not a methodology, not a target, etc.

We used key and value because a key/value pair is the simplest expression of a property class.

The representative of the definition of a class (the key) and an instance of that class (the value).

That does not constrain or mandate any particular data model or information space.

Rather than mapping to the TMRM, we should say mapping using the principles of the TMRM.

I will say more in a later post, but for example, what subject does a topic represent?

With disclosure for the TMDM and RDF, we might not agree on the mapping, but it would be transparent. And useful.

November 19, 2010

All Identifiers, All The Time – LOD As An Answer?

Filed under: Linked Data,LOD,RDA,Semantic Web,Subject Identity — Patrick Durusau @ 6:25 am

I am still musing over Thomas Neidhart’s comment:

To understand this identifier you would need implicit knowledge about the structure and nature of every possible identifier system in existence, and then you still do not know who has more information about it.

Aside from questions of universal identifier systems failing without exception in the past, which makes one wonder why this system should succeed, there are other questions.

Such as why would any system need to encounter every possible identifier system in existence?

That is the LOD effort has setup a strawman (apologies for the sexism) that it then proceeds to blow down.

If a subject has multiple identifiers in a set and my system recognizes only one out of three, what harm has come of the subject having the other two identifiers?

There is no processing overhead since by admission the system does not recognize the other identifier so it doesn’t process them.

The advantage being that some other system make recognize the subject on the basis of the other identifiers.

This post is a good example of that practice.

I had a category “Linked Data,” but I added a category this morning, “LOD,” just in case people search for it that way.

Why shouldn’t our computers adapt to how we use identifiers (multiple ones for the same subjects) rather than our attempting (and failing) to adapt to universal identifiers to make it easy for our computers?

November 18, 2010

A Direct Mapping of Relational Data to RDF

Filed under: Ambiguity,RDF,Semantic Web,Subject Identity — Patrick Durusau @ 7:15 pm

A Direct Mapping of Relational Data to RDF

A major step towards putting relational data “on the web.”

Identifying what that data means and providing a basis for reconciling it with other data remains to be addressed.

URIs and Identity

Filed under: Ambiguity,RDF,Semantic Web,Subject Identity,Topic Maps — Patrick Durusau @ 6:55 pm

If I read Halpin and others correctly, URIs identify the subjects they identify, except when they identify some other subject and it isn’t possible to know which of any number of subjects is being identified.

That is what I (and others) take as “ambiguity.”

Some readers have taken my comments to on URIs to be critical of RDF, which wasn’t my intent.

What I object to is the sentiment that everyone should use only URIs and then cherry pick any RDF graph that may result for identity purposes.

For example, in a family tree, there may be an entry: John Smith.

For which we can create: http://myfamilytree.smith.com/john_smith

That may resolve to an RDF graph but what properties in that graph identify a particular John Smith?

A “uniform” syntax for that “identifier” isn’t helpful if we all reach various conclusions about what properties in the graph to use for identification.

Or if we have different tests to evaluate the values of those properties.

Even with an RDF graph and rules for which properties to evaluate, we may still have ambiguity.

But rules for evaluation of RDF graphs for identity lessen the ambiguity.

All within the context, format, data model of RDF.

It does detract from URIs as identifiers but URIs as identifiers are no more viable than any single token as an identifier.

Sets of key/value pairs, which are made up of tokens, have the potential to lessen ambiguity, but not banish it.

« Newer PostsOlder Posts »

Powered by WordPress