Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 25, 2010

Consensus of Ambiguity: Theory and Application of Active Learning for Biomedical Image Analysis

Consensus of Ambiguity: Theory and Application of Active Learning for Biomedical Image Analysis Authors: Scott Doyle, Anant Madabhushi Keywords:

Abstract:

Supervised classifiers require manually labeled training samples to classify unlabeled objects. Active Learning (AL) can be used to selectively label only “ambiguous” samples, ensuring that each labeled sample is maximally informative. This is invaluable in applications where manual labeling is expensive, as in medical images where annotation of specific pathologies or anatomical structures is usually only possible by an expert physician. Existing AL methods use a single definition of ambiguity, but there can be significant variation among individual methods. In this paper we present a consensus of ambiguity (CoA) approach to AL, where only samples which are consistently labeled as ambiguous across multiple AL schemes are selected for annotation. CoA-based AL uses fewer samples than Random Learning (RL) while exploiting the variance between individual AL schemes to efficiently label training sets for classifier training. We use a consensus ratio to determine the variance between AL methods, and the CoA approach is used to train classifiers for three different medical image datasets: 100 prostate histopathology images, 18 prostate DCE-MRI patient studies, and 9,000 breast histopathology regions of interest from 2 patients. We use a Probabilistic Boosting Tree (PBT) to classify each dataset as either cancer or non-cancer (prostate), or high or low grade cancer (breast). Trained is done using CoA-based AL, and is evaluated in terms of accuracy and area under the receiver operating characteristic curve (AUC). CoA training yielded between 0.01-0.05% greater performance than RL for the same training set size; approximately 5-10 more samples were required for RL to match the performance of CoA, suggesting that CoA is a more efficient training strategy.

The consensus of ambiguity (CoA) is trivially extensible to other image analysis. Intelligence photos anyone?

What intrigues me is extension of that approach to other types of data analysis.

Such as having multiple AL schemes process textual data and follow the CoA approach on what to bounce to experts for annotation.

Questions:

  1. What types of ambiguity would this approach miss?
  2. How would you apply this method to other data?
  3. How would you measure success/failure of application to other data?
  4. Design and apply this concept to specified data set. (project)

An Evaluation of TS13298 in the Scope of MoReq2

An Evaluation of TS13298 in the Scope of MoReq2. Authors: Gülten Al?r, Thomas Sødring and ?rem Soydal Keywords: TS13298, MoReq2, electronic records management standards.

Abstract:

TS13298 is the first Turkish standard developed for electronic records management. It was published in 2007 and is particularly important when developing e-government services. MoReq2, which was published in 2008 as an initiative of the European Union countries, is an international “de facto” standard within the field of electronic records management. This paper compares and evaluates the content and presentation of the TS13298 and MoReq2 standards, and similarities and differences between the two standards are described. Moreover, the question of how MoReq2 can be used as a reference when updating TS13298 is also dealt with. The method of hermeneutics is used for the evaluation, and the texts of TS13298 and MoReq2 were compared and reviewed. These texts were evaluated in terms of terminology, access control and security, retention and disposition, capture and declaring, search, retrieval, presentation and metadata scheme. We discovered that TS13298 and MoReq2 have some “requirements” in common. However, the MoReq2 requirements, particularly in terms of control and security, retention and disposition, capture and declaration, search and presentation, are both vaster and more detailed than those of TS13298. As a conclusion it is emphasized that it would be convenient to update TS13298 by considering these requirements. Moreover, it would be useful to update and improve TS13298 by evaluating MoReq2 in terms of terminology and metadata scheme.

This article could form the basis for a topic map of these standards to facilitate convergence of these standards.

It also illustrates how a title search on “electronic records” would miss an article of interest.

1. “Knowledge can only be volunteered it cannot be conscripted.”

Filed under: Knowledge Management,Marketing,Topic Maps — Patrick Durusau @ 6:39 am

Knowledge Management Principle One of Seven (Rendering Knowledge by David Snowden)

Knowledge can only be volunteered it cannot be conscripted. You can’t make someone share their knowledge, because you can never measure if they have. You can measure information transfer or process compliance, but you can’t determine if a senior partner has truly passed on all their experience or knowledge of a case.

To create successful topic maps, there must be incentives for sharing the information that forms the topic map.

Sharing of information should be rewarded, frequently and publicly, short and long term.

Example of failure to create incentives for sharing information: U.S. Intelligence Community.

If your organization, business, enterprise, government, government-in-waiting deserves better than that.

Create incentives for sharing information and start building topic maps today!

The Short Comings of Full-Text Searching

The Short Comings of Full-Text Searching by Jeffrey Beall from the University of Colorado Denver.

  1. The synonym problem.
  2. Obsolete terms.
  3. The homonym problem.
  4. Spamming.
  5. Inability to narrow searches by facets.
  6. Inability to sort search results.
  7. The aboutness problem.
  8. Figurative language.
  9. Search words not in web page.
  10. Abstract topics.
  11. Paired topics.
  12. Word lists.
  13. The Dark Web.
  14. Non-textual things.

Questions:

  1. Watch the slide presentation.
  2. Can you give three examples of each short coming? (excluding #5 and #6, which strike me as interface issues, not searching issues)
  3. How would you “solve” the word list issue? (Don’t assume quantum computing, etc. There are simpler answers.)
  4. Is metadata the only approach for “non-textual things?” Can you cite 3 papers offering other approaches?

October 24, 2010

Recognizing Synonyms

Filed under: Marketing,Subject Identity,Synonymy — Patrick Durusau @ 11:04 am

I saw a synonym that I recognized the other day and started wondering how I recognized it?

The word I had in mind was “student” and the synonym was “pupil.”

Attempts to recognize synonyms:

  • spelling: student, pupil – No.
  • length: student 7 letters, pupil 5 letters – No.
  • origin: student – late 14c., from O.Fr. estudient , pupil – from O.Fr. pupille (14c.) – No. [1]
  • numerology: student (a = 1, b = 2 …) student = 19 + 20 + 21 + 4 + 5 + 14 + 20 = 69 ; pupil = 16 + 21 + 16 + 9 + 12 = 74 – No [2].

But I know “student” and “pupil” to be synonyms.[3]

I could just declare them to be synonyms.

But then how do I answer questions like:

  • Why did I think “student” and “pupil” were synonyms?
  • What would make some other term a synonym of either “student” or “pupil?”
  • How can an automated system match my finding of more synonyms?

Provisional thoughts on answers to follow this week.

Questions:

Without reviewing my answers in this series, pick a pair of synonyms and answer those three questions for that pair. (There are different answers than mine.)

*****

[1] Synonym origins from: Online Etymology Dictionary

[2] There may be some Bible code type operation that can discover synonyms but I am unaware of it.

[3] They are synonyms now, that wasn’t always the case.

The Role of Sparse Data Representation in Semantic Image Understanding

Filed under: Bioinformatics,Biomedical,Image Understanding,Sparse Image Representation — Patrick Durusau @ 10:56 am

The Role of Sparse Data Representation in Semantic Image Understanding Author: Artur Przelaskowski Keywords: Computational intelligence, image understanding, sparse image representation, nonlinear approximation, semantic information theory

Abstract:

This paper discusses a concept of computational understanding of medical images in a context of computer-aided diagnosis. Fundamental research purpose was improved diagnosis of the cases, formulated by human experts. Designed methods of soft computing with extremely important role of: a) semantically sparse data representation, b) determined specific information, formally and experimentally, and c) computational intelligence approach were adjusted to the challenges of image-based diagnosis. Formalized description of image representation procedures was completed with exemplary results of chosen applications, used to explain formulated concepts, to make them more pragmatic and assure diagnostic usefulness. Target pathology was ontologically described, characterized by as stable as possible patterns, numerically described using semantic descriptors in sparse representation. Adjusting of possible source pathology to computational map of target pathology was fundamental issue of considered procedures. Computational understanding means: a) putting together extracted and numerically described content, b) recognition of diagnostic meaning of content objects and their common significance, and c) verification by comparative analysis with all accessible information and knowledge sources (patient record, medical lexicons, the newest communications, reference databases, etc.).

Interesting in its own right for image analysis in the important area of medical imaging but caught my eye for another reason.

Sparse data representation works for understanding images.

Would it work in other semantic domains?

Questions:

  1. What are the minimal clues that enable us to understand a particular text?
  2. Can we learn those clues before we encounter a particular text?
  3. Can we create clues for others to use when encountering a particular text?
  4. How would we identify the text for application of our clues?

Introduction to Biomedical Ontologies

Filed under: Biomedical,Ontology — Patrick Durusau @ 9:58 am

A very good introduction to ontologies: Introduction to Biomedical Ontologies.

This introduction neatly frames the issue addressed by both controlled vocabularies (ontologies) and topic maps.

When faced with multiple terms for a single subject, a controlled vocabulary (ontology), solves the problem by using a single term.

Other terms that mean the same subject are “near synonyms.”

Watch the video and then check back here for a post called: Near Synonyms.

I will discuss how the treatment of “near synonyms” differs between topic maps and controlled vocabularies (ontologies).

Indexing Nature: Carl Linnaeus (1707-1778) and His Fact-Gathering Strategies

Filed under: Indexing,Information Retrieval,Interface Research/Design,Ontology — Patrick Durusau @ 9:52 am

Indexing Nature: Carl Linnaeus (1707-1778) and His Fact-Gathering Strategies Authors: Staffan Müller-Wille & Sara Scharf (Working Papers on The Nature of Evidence: How Well Do ‘Facts’ Travel? No. 36/08)

Interesting article that traces the strategies used by Linnaeus when confronted with the “first bio-information crisis” as the authors term it.

Questions:

  1. In what ways do ontologies resemble the bound library catalogs of the early 18th century?
  2. Do computers make ontologies any less like those bound library catalogs?
  3. Short report (3-5 pages, with citations) on transition of libraries from bound catalogs to index cards.
  4. Linnaeus’s colleagues weren’t idle. What other strategies, successful or otherwise, were in use? (project)

October 23, 2010

CASPAR (Cultural, Artistic, and Scientific Knowledge for Preservation, Access and Retrieval)

Filed under: Cataloging,Digital Library,Information Retrieval,Preservation — Patrick Durusau @ 7:58 am

CASPAR (Cultural, Artistic, and Scientific Knowledge for Preservation, Access and Retrieval).

From the website:

CASPAR methodological and technological solution:

  • is compliant to the OAIS Reference Model – the main standard of reference in digital preservation
  • is technology-neutral: the preservation environment could be implemented using any kind of emerging technology
  • adopts a distributed, asynchronous, loosely coupled architecture and each key component is self-contained and portable: it may be deployed without dependencies on different platform and framework
  • is domain independent: it could be applied with low additional effort to multiple domains/contexts.
  • preserves knowledge and intelligibility, not just the “bits”
  • guarantees the integrity and identity of the information preserved as well as the protection of digital rights

FYI: OAIS Reference Model

As a librarian, you will be confronted with claims similar to these in vendor literature, grant applications and other marketing materials.

Questions:

  1. Pick one of these claims. What documentation/software produced by the project would you review to evaluate the claim you have chosen?
  2. What other materials do you think would be relevant to your review?
  3. Perform the actual review (10 – 15 pages, with citations, project)

SLiMSearch: A Webserver for Finding Novel Occurrences of Short Linear Motifs in Proteins, Incorporating Sequence Context

Filed under: Bioinformatics,Biomedical,Pattern Recognition,Subject Identity — Patrick Durusau @ 5:56 am

SLiMSearch: A Webserver for Finding Novel Occurrences of Short Linear Motifs in Proteins, Incorporating Sequence Context Authors: Norman E. Davey, Niall J. Haslam, Denis C. Shields, Richard J. Edwards Keywords: short linear motif, motif discovery, minimotif, elm

Short, linear motifs (SLiMs) play a critical role in many biological processes. The SLiMSearch (Short, Linear Motif Search) webserver is a flexible tool that enables researchers to identify novel occurrences of pre-defined SLiMs in sets of proteins. Numerous masking options give the user great control over the contextual information to be included in the analyses, including evolutionary filtering and protein structural disorder. User-friendly output and visualizations of motif context allow the user to quickly gain insight into the validity of a putatively functional motif occurrence. Users can search motifs against the human proteome, or submit their own datasets of UniProt proteins, in which case motif support within the dataset is statistically assessed for over- and under-representation, accounting for evolutionary relationships between input proteins. SLiMSearch is freely available as open source Python modules and all webserver results are available for download. The SLiMSearch server is available at: http://bioware.ucd.ie/slimsearch.html .

Software: http://bioware.ucd.ie/slimsearch.html

Seemed like an appropriate resource to follow on today’s earlier posting.

Note in the keywords, “elm.”

Care to guess what that means? If you are a bioinformatics or biology person you may get it correct.

What do you think the odds are that any person much less a general search engine will get it correct?

Topic maps are about making sure you find: Eukaryotic Linear Motif Resource without wading through what a search of any common search engine returns for “elm.”

Questions:

  1. What other terms in this paper represent other subjects?
  2. What properties would you use to identify those subjects?
  3. How would you communicate those subjects to someone else?

An Algorithm to Find All Identical Motifs in Multiple Biological Sequences

Filed under: Bioinformatics,Biomedical,Pattern Recognition,Subject Identity — Patrick Durusau @ 5:25 am

An Algorithm to Find All Identical Motifs in Multiple Biological Sequences Authors: Ashish Kishor Bindal, R. Sabarinathan, J. Sridhar, D. Sherlin, K. Sekar Keywords: Sequence motifs, nucleotide and protein sequences, identical motifs, dynamic programming, direct repeat and phylogenetic relationships

Sequence motifs are of greater biological importance in nucleotide and protein sequences. The conserved occurrence of identical motifs represents the functional significance and helps to classify the biological sequences. In this paper, a new algorithm is proposed to find all identical motifs in multiple nucleotide or protein sequences. The proposed algorithm uses the concept of dynamic programming. The application of this algorithm includes the identification of (a) conserved identical sequence motifs and (b) identical or direct repeat sequence motifs across multiple biological sequences (nucleotide or protein sequences). Further, the proposed algorithm facilitates the analysis of comparative internal sequence repeats for the evolutionary studies which helps to derive the phylogenetic relationships from the distribution of repeats.

Good illustration that subject identification, here sequence motifs in nucleotide and protein sequences, varies by domain.

Subject matching in this type of data on the basis of assigned URL identifiers for sequence motifs would be silly.

But that’s the question isn’t it? What is the appropriate basis for subject matching in a particular domain?

Questions:

  1. Identify and describe one (1) domain where URL matching for subjects would be unnecessary overhead. (3 pages, no citations)
  2. Identify and describe one (1) domain where URL matching for subjects would be useful. (3 pages, no citations)
  3. What are the advantages of URLs as a lingua franca? (3 pages, no citations)
  4. What are the disadvantages of URLs as a lingua franca? (3 pages, no citations)

***
BTW, when you see “no citations” that does not mean you should not be reading the relevant literature. What is means is that I want your analysis of the issues and not your channeling of the latest literature.

October 22, 2010

Rethinking Library Linking: Breathing New Life into OpenURL

Filed under: Cataloging,Indexing,OpenURL,Subject Identity,Topic Maps — Patrick Durusau @ 7:26 am

Rethinking Library Linking: Breathing New Life into OpenURL Authors: Cindi Trainor and Jason Price

Abstract:

OpenURL was devised to solve the “appropriate copy problem.” As online content proliferated, it became possible for libraries to obtain the same content from multiple locales: directly from publishers and subscription agents; indirectly through licensing citation databases that contain full text; and, increasingly, from free online sources. Before the advent of OpenURL, the only way to know whether a journal was held by the library was to search multiple resources. An OpenURL link resolver accepts links from library citation databases (sources) and returns to the user a menu of choices (targets) that may include links to full text, the library catalog, and other related services (figure 1). Key to understanding OpenURL is the concept of “context sensitive” linking: links to the same item will be different for users of different libraries, and are dependent on the library’s collections. This issue of Library Technology Reports provides practicing librarians with real-world examples and strategies for improving resolver usability and functionality in their own institutions.

Resources:

OpenURL (ANSI/NISO Z39.88-2004

openURL@oclc.org archives

Questions:

  1. OCLC says of OpenURL

    Remember the card catalog? Everything in a library was represented in the card catalog with one or more cards carrying bibliographic information. OpenURL is the internet equivalent of those index cards.

  2. True? 3-5 pages, no citations, or
  3. False? 3-5 pages, no citations.

Neo4j 1.2 Milestone 2 – Release

Filed under: Graphs,Indexing,Neo4j,Software — Patrick Durusau @ 6:02 am

Neo4j 1.2 Milestone 2 has been released!

Relevant to topic maps in general and TMQL in particular, are the improvement to indexing and querying capabilities.

Neo4j uses Lucene as a back-end.

Would Neo4j be a good way to proto-type proposals for TMQL?

To evaluate concerns about implementation difficulties.

And quite possibly to encourage the non-invention of new query syntaxes.

A side effect would be demonstrating that Neo4j could be used as a topic map platform.

National Center for Biomedical Ontology

Filed under: Biomedical,Health care,Ontology — Patrick Durusau @ 6:00 am

National Center for Biomedical Ontology

I feel like a kid in a candy store at this site.

I suppose it is being an academic researcher at heart.

Reports on specific resources to follow.

Linking Enterprise Data

Filed under: Knowledge Management,Linked Data,Semantic Web — Patrick Durusau @ 5:53 am

Linking Enterprise Data, ed. by David Wood. The full text is available in HTML.

Table of Contents:

  • Part I Why Link Enterprise Data?
    • Semantic Web and the Linked Data Enterprise, Dean Allemang
    • The Role of Community-Driven Data Curation for Enterprises, Edward Curry, Andre Freitas, and Sean O’Riain
  • Part II Approval and Support of Linked Data Projects
    • Preparing for a Linked Data Enterprise, Bernadette Hyland
    • Selling and Building Linked Data: Drive Value and Gain Momentum, Kristen Harris
  • Part III Techniques for Linking Enterprise Data
    • Enhancing Enterprise 2.0 Ecosystems Using Semantic Web and Linked Data Technologies: The SemSLATES Approach, Alexandre Passant, Philippe Laublet, John G. Breslin and Stefan Decker
    • Linking XBRL Financial Data, Roberto García and Rosa Gil
    • Scalable Reasoning Techniques for Semantic Enterprise Data, Reza B’Far
    • Reliable and Persistent Identification of Linked Data Elements, David Wood

Comments to follow.

October 21, 2010

mloss.org – machine learning open source software

mloss.org – machine learning open source software

Open source repository of machine learning software.

Not only are subjects being recognized by these software packages but their processes and choices are subjects as well. Not to mention their description in the literature.

Fruitful grounds for adaptation to topic maps as well as being the subject of topic maps.

There are literally hundreds of software packages here so I welcome suggestions, comments, etc. on any and all of them.

Questions:

  1. Examples of vocabulary mis-match in machine learning literature?
  2. Using one sample data set, how would you integrate results from different packages? Assume you are not merging classifiers.
  3. What if the classifiers are unknown? That is all you have are the final results. Is your result different? Reliable?
  4. Describe a (singular) merging of classifiers in subject identity terms.

A Survey of Genetics-based Machine Learning

Filed under: Evoluntionary,Learning Classifier,Machine Learning,Neural Networks — Patrick Durusau @ 5:15 am

A Survey of Genetics-based Machine Learning Author: Tim Kovacs

Abstract:

This is a survey of the field of Genetics-based Machine Learning (GBML): the application of evolutionary algorithms to machine learning. We assume readers are familiar with evolutionary algorithms and their application to optimisation problems, but not necessarily with machine learning. We briefly outline the scope of machine learning, introduce the more specific area of supervised learning, contrast it with optimisation and present arguments for and against GBML. Next we introduce a framework for GBML which includes ways of classifying GBML algorithms and a discussion of the interaction between learning and evolution. We then review the following areas with emphasis on their evolutionary aspects: GBML for sub-problems of learning, genetic programming, evolving ensembles, evolving neural networks, learning classifier systems, and genetic fuzzy systems.

The author’s preprint has 322 references. Plus there are slides, bibliographies in BibTeX.

If you are interesting in augmented topic map authoring using GBML, this would be a good starting place.

Questions:

  1. Pick 3 subject areas. What arguments would you make in favor of GBML for augmenting authoring of a topic map for those subject areas?
  2. Same subject areas, but what arguments would you make against the use of GBML for augmenting authoring of a topic map for those subject areas?
  3. Design an experiment to test one of your arguments for and against GBML. (project, use of the literature encouraged)
  4. Convert the BibTeX formatted bibliographies into a topic map. (project)

Shogun – A Large Scale Machine Learning Toolbox

Filed under: Bioinformatics,Cheminformatics,Kernel Methods,Pattern Recognition — Patrick Durusau @ 5:08 am

Shogun – A Large Scale Machine Learning Toolbox

Not for the faint of heart but an excellent resource for those interested in large scale kernel methods.

Offers several Support Vector Machine (SVM) implementations and implementations of the latest kernels. Has interfaces to Mathlab(tm), R, Octave and Python.

Questions:

  1. Pick any one of the methods. How would you integrate it into augmented authoring for a topic map?
  2. What aspect(s) of this site would you change using topic maps?
  3. What augmented authoring techniques that would help you apply topic maps to this site?
  4. Apply topic maps to this site. (project)

Research: What is the Interaction Cost in Information Visualization?

Research: What is the Interaction Cost in Information Visualization? by Enrico Bertini, came to us via Sam Hunting.

A summary of Heidi Lam’s A Framework of Interaction Costs in Information Visualization but both will repay the time spent reading/studying them.

However intuitive it may seem to its designers, no “semantic” interface is any better than it is perceived to be by its users.

Questions:

  1. After reading Lam’s article, evaluate two interfaces, one familiar to you and one you encounter as a first-time user.
  2. Using Lam’s framework, how do you evaluate the interfaces?
  3. What aspects of those interfaces would you most like to test with users?
  4. Design a test for two aspects of one of your interfaces. (project*)
  5. Care to update Lam’s listing of papers listing interactivity issues? (project)

* Warning: Test design is partially an art, partially a science and partially stumbling around in semantic darkness. Just so you are aware that done properly, this project will require extra work.

October 20, 2010

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification
Authors: Aaron Smalter, Jun Huan and Gerald Lushington

Abstract:

Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogeneous sources and have been applied in modeling many kinds of biological data. With the fast accumulation of chemical structure data, building highly accurate predictive models for chemical graphs emerges as a new challenge.

In this paper, we demonstrate a novel technique called Graph Pattern Matching kernel (GPM). Our idea is to leverage existing ? frequent pattern discovery methods and explore their application to kernel classifiers (e.g. support vector machine) for graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the database and use a diffusion process to label nodes in the graphs. Finally the kernel is computed using a set matching algorithm. We performed experiments on 16 chemical structure data sets and have compared our methods to other major graph kernels. The experimental results demonstrate excellent performance of our method.

The authors also note:

Publicly-available large-scale chemical compound databases have offered tremendous opportunities for creating highly efficient in silico drug design methods. Many machine learning and data mining algorithms have been applied to study the structure-activity relationship of chemicals with the goal of building classifiers for graph-structured data.

In other words, with a desktop machine, public data and a little imagination, you can make a fundamental contribution to drug design methods. (FWI, the pharmaceuticals are making money hand over fist.)

Integrating your contribution or its results into existing information, such as with topic maps, will only increase its value.

Integrating Biological Data – Not A URL In Sight!

Actual title: Kernel methods for integrating biological data by Dick de Ridder, The Delft Bioinformatics Lab, Delft University of Technology.

Biological data integration to improve protein expression – read hugely profitable industrial processes based on biology.

Need to integrate biological data, including “prior knowledge.”

In case kernel methods aren’t your “thing,” one important point:

There are vast seas of economically important data unsullied by URLs.

Kernel methods are one method to integrate some of that data.

Questions:

  1. How to integrate kernel methods into topic maps? (research project)
  2. Subjects in a kernel method? (research paper, limit to one method)
  3. Modeling specific uses of kernels in topic maps. (research project)
  4. Edges of kernels? Are there subject limits to kernels? (research project>

Variations/FRBR: Variations as a Testbed for the FRBR Conceptual Model

Filed under: Dataset,FRBR,Search Interface,Searching — Patrick Durusau @ 3:18 am

FRBRized data in XML for free download!

Approximately 80,000 bibliographic records for musical recordings and 105,000 or so for scores.

Be sure to take a look at the search interface and submit suggestions.

From the post:

The Variations/FRBR [1] project at Indiana University has released bulk downloads of metadata for the sound recordings presented in our Scherzo [2] music discovery system in a FRBRized XML format. The downloadable data includes FRBR Work, Expression, Manifestation, Person, and Corporate Body records, along with the structural and responsibility relationships connecting them. While this is still an incomplete representation of FRBR and FRAD, we hope that the release of this data will aid others that are studying or working with FRBR. This XML data conforms to the “efrbr” set of XML Schemas [3] created for this project.

The XML data may be downloaded from http://vfrbr.info/data/1.0/index.shtml, and comments/questions may be directed to vfrbr@dlib.indiana.edu.

One caveat to those who seek to use this data: we plan to continue improving our FRBRization algorithm into the future and have not yet implemented a way to keep entity identifiers consistent between new data loads. Therefore we cannot at this time guarantee the Work with the identifier http://vfrbr.info/work/30001, for example, will have the same identifier in the future. Therefore this data at this time should be considered highly experimental.

Many thanks to the Institute of Museum and Library Services for funding the V/FRBR project.

Also, if you’re interested in FRBR, please do check out our experimental discovery system: . We’re very interested in your feedback!

Jenn

[1] V/FRBR project home page (http://vfrbr.info); FRBR report
(http://www.ifla.org/en/publications/functional-requirements-for-bibliographic-records)

[2] Scherzo (http://vfrbr.info/search)

[3] V/FRBR project XML Schemas (http://vfrbr.info/schemas/1.0/index.shtml)

Information shamelessly stolen from Last Week in FRBR #33.

8th Extended Semantic Web Conference: May 29 – June 2 2011 Heraklion, Greece

Filed under: Conferences,Ontology,OWL,Semantic Web,Semantics,SPARQL — Patrick Durusau @ 3:15 am

8th Extended Semantic Web Conference: May 29 – June 2 2011 Heraklion, Greece

Important Dates

See ESWC 2010 for range of content.

October 19, 2010

Guessing Explanations?

Filed under: Interface Research/Design,Marketing,Topic Maps,Uncategorized — Patrick Durusau @ 9:44 am

Our apparent inability to imagine other audiences keeps nagging at me.

If that is true for hierarchical arrangements, then it must be true for indexes as well.

So far, standard topic maps sort of thinking.

What if that applies to explanations as well?

That is I create better explanations when I imagine the audience to be like me.

And don’t try to guess what others will find a good explanation?

Why not test explanations with audiences?

Make explanation, even of topic maps, a matter of empirical investigation rather than formal correctness.

Enhancing Graph Database Indexing by Suffix Tree Structure

Filed under: Graphs,Indexing,Suffix Tree — Patrick Durusau @ 8:02 am

Enhancing Graph Database Indexing by Suffix Tree Structure
Authors: Vincenzo Bonnici, Alfredo Ferro, Rosalba Giugno, Alfredo Pulvirenti, Dennis Shasha Keywords: subgraph isomorphism, graph database search, indexing, suffix tree, molecular database

Abstract:

Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, scientists require systems that search for all occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed.

This paper presents GraphGrepSX. The system implements efficient graph searching algorithms together with an advanced filtering technique. GraphGrepSX is compared with SING, GraphFind, CTree and GCoding. Experiments show that GraphGrepSX outperforms the compared systems on a very large collection of molecular data. In particular, it reduces the size and the time for the construction of large database index and outperforms the most popular systems. (hyperlinks added.)

Be aware that bioinformatics is at the cutting edge of search/retrieval technology. Pick up any proceedings volume for the last year to see what I mean.

A credible topic map is going to incorporate one or more of the techniques you will find there, plus semantic mapping based on those techniques.

Saying Topic-Association-Occurrence is only going to get you past the first two minutes of your presentation. You will need something familiar (to your audience) and domain specific to fill the rest of your time.

BTW, see the audience posting earlier today. Don’t guess what will interest your audience. Ask someone in that community what interests them.

Fast Secure Computation of Set Intersection

Filed under: Security,Set Intersection,Sets — Patrick Durusau @ 6:21 am

Fast Secure Computation of Set Intersection Authors: Stanis?aw Jarecki and Xiaomin Liu

Introduction:

Secure Protocol for Computing Set Intersection and Extensions. Secure computation of set intersection (or secure evaluation of a set intersection function) is a protocol which allows two parties, sender S and receiver R, to interact on their respective input sets X and Y in such a way that R learns X ? Y and S learns nothing. Secure computation of set intersection has numerous useful applications: For example, medical institutions could find common patients without learning any information about patients that are not in the intersection, different security agencies could search for common items in their databases without revealing any other information, the U.S. Department of Homeland Security can quickly find if there is a match between a passenger manifest and its terrorist watch list, etc.

Imagine partial sharing of a topic map in a secure environment.

The article has a useful review of work in this area.

Curious if this really prevents learning of additional information.

If the source is treated as a black box and subjects are projected on the basis of responses to different receivers, with mapping between those,…, well, that had better wait for a future post. (Or a contract from someone interested in breaching a secure system. 😉 )

The effect of audience design on labeling, organizing, and finding shared files (unexpected result – see below)

The effect of audience design on labeling, organizing, and finding shared files Authors: Emilee Rader Keywords: audience design, common ground, file labeling and organizing, group information management

Abstract:

In an online experiment, I apply theory from psychology and communications to find out whether group information management tasks are governed by the same communication processes as conversation. This paper describes results that replicate previous research, and expand our knowledge about audience design and packaging for future reuse when communication is mediated by a co-constructed artifact like a file-and-folder hierarchy. Results indicate that it is easier for information consumers to search for files in hierarchies created by information producers who imagine their intended audience to be someone similar to them, independent of whether the producer and consumer actually share common ground. This research helps us better understand packaging choices made by information producers, and the direct implications of those choices for other users of group information systems.

Examples from the paper:

  • A scientist needs to locate procedures and results from an experiment conducted by another researcher in his lab.
  • A student learning the open-source, command-line statistical computing environment R needs to find out how to calculate the mode of her dataset.
  • A new member of a design team needs to review requirements analysis activities that took place before he joined the team.
  • An intelligence analyst needs to consult information collected by other agencies to assess a potential threat.

Do any of those sound familiar?

Unexpected result:

In general, Consumers performed best (fewest clicks to find the target file) when the Producer created a hierarchy for an Imagined Audience from the same community, regardless of the community the Consumer community. Consumers had the most difficulty when searching in hierarchies created by a Producer for a dissimilar Imagined Audience.

In other words, imagining an audience is a bad strategy. Create a hierarchy that works for you. (And with a topic map you could let others create hierarchies that work for them.)

(Apologies for the length of this post but unexpected interface results merit the space.)

October 18, 2010

Vassallo, FRAD – ISAAR(CPF) – EAC-CPF – Topic Maps Mapping – Post

Filed under: Authority Record,Topic Maps — Patrick Durusau @ 8:59 am

Vassallo, FRAD – ISAAR(CPF) – EAC-CPF – Topic Maps Mapping

Description:

A corollary subject of my phd thesis was the analysis of parallelisms between libriarian and archival world.

One of the points of contact is in the description of agents and in the costitution of authority files.

For those aims, a mapping between FRAD (Functional Requirements for Authority Data) and ISAAR(CPF) (International Standard Archival Authority Record for Corporate Bodies, persons and families) could be useful.

Sample mappings are available for download.

A nice intersection of library and topic map issues.

TMDM-NG – Reification

Filed under: Authoring Topic Maps,TMDM,Topic Maps,XTM — Patrick Durusau @ 7:44 am

Reification in the TMDM means using a topic to “reify” a name, occurrence, association, etc. Whatever a subject is represented by a name, occurrence or association, after “reification” it is also also represented by a topic.

For the TMDM-NG, let’s drop reification and make names, occurrences, associations, etc., first class citizens in a topic map.

Making names, occurrences, associations first class citizens would mean we could add properties to them without the overhead of creating topics to represent subjects that already have representatives in a topic map.

Do need to work on occurrence being overloaded to mean both in the bibliographic sense as well as a property but that can wait for a future post.

The X Factor of Information Systems

Filed under: Information Retrieval,Natural Language Processing,Semantics — Patrick Durusau @ 5:02 am

David Segal’s “The X Factor of Economics,” NYT, Sunday, October 17, 2010, Week in Review, concludes that standard economic models don’t account for one critical factor.

Economics can be dressed up in mathematical garb, with after the fact precision, but the X factor causes it to lack before the fact precision. Precision? Seems like an inadequate term for a profession that can’t agree on what has happened, is in fact happening, much less what is about to happen.

But in any event, the X factor? That would be us, people.

People who gleefully buy, save, work, rest and generally live our lives without any regard for theories of economic behavior.

The same people who live without any regard for theories of semantics.

People are the X factor in information systems.

Just a caution to take into account when evaluating information, metadata or semantic systems.

« Newer PostsOlder Posts »

Powered by WordPress