Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 4, 2010

Exploring Homology Using the Concept of Three-State Entropy Vector

Filed under: Bioinformatics,Biomedical,Data Mining — Patrick Durusau @ 3:24 pm

Exploring Homology Using the Concept of Three-State Entropy Vector Authors: Armando J. Pinho, Sara P. Garcia, Paulo J. S. G. Ferreira, Vera Afreixo, Carlos A. C. Bastos, António J. R. Neves, João M. O. S. Rodrigues Keywords: DNA signature, DNA coding regions, DNA entropy, Markov models

Abstract:

The three-base periodicity usually found in exons has been used for several purposes, as for example the prediction of potential genes. In this paper, we use a data model, previously proposed for encoding protein-coding regions of DNA sequences, to build signatures capable of supporting the construction of meaningful dendograms. The model relies on the three-base periodicity and provides an estimate of the entropy associated with each of the three bases of the codons. We observe that the three entropy values vary among themselves and also from species to species. Moreover, we provide evidence that this makes it possible to associate a three-state entropy vector with each species, and we show that similar species are characterized by similar three-state entropy vectors.

I include this paper both as informative for the bioinformatics crowd as well as to illustrate that subject identity tests are as varied as the subjects they identify. In this particular case, the identification of species for the construction of dendograms.

November 25, 2010

The Genomics and Bioinformatics Group

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:16 am

The Genomics and Bioinformatics Group

From the website:

The GBG’s mission is to manage and assess molecular interaction data obtained through multiple platforms, increase the understanding of the effect of those interactions on the chemosensitivity of cancer, and create tools that will facilitate that process. Translation of that information will be directed towards the recognition of diagnostic and therapeutic cancer biomarkers, and directed cancer therapy.

If you are interested in bioinformatics and the type of tools currently in use, this is a good place to start.

Questions:

  1. Choose one of the tools. What subject identity test(s) are implicit in the tool? (3-5 pages, no citations)
  2. Can the results or data for the tool be easily mapped to professional literature? Why/Why not? (3-5 pages, no citations)
  3. How does the identity of results differ from the identity of data? If it does? (3-5 pages, no citations)

November 15, 2010

Towards Index-based Similarity Search for Protein Structure Databases

Filed under: Bioinformatics,Biomedical,Indexing,Similarity — Patrick Durusau @ 5:00 am

Towards Index-based Similarity Search for Protein Structure Databases Authors: Orhan Çamoǧlu, Tamer Kahveci, Ambuj K. Singh Keywords: Protein structures, feature vectors, indexing, dataset join

Abstract:

We propose two methods for finding similarities in protein structure databases. Our techniques extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. These feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. This technique quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times while keeping the sensitivity similar.

Unless you want to do a project on bioinformatics indexing and topic maps, this paper probably isn’t of much interest.

I include it as an illustration of fashioning an domain specific index and for those who are interested, what subjects and their definitions lurk therein.

Questions (for those who want to pursue both topic maps and bioinformatics):

  1. Isolate all the “we chose” aspects of the paper. What results would have been different with other choices? The “we obtained best results…” is unsatisfying. In what sense “best results?”
  2. What aspects of this process would be amenable to use of a topic map?
  3. What about the results (if anything) would have to be different to make these results meaningful in a topic map to be merged with results by other researchers?

November 7, 2010

2nd International Conference on Computational & Mathematical Biomedical Engineering – Conference – 2011

Filed under: Biomedical,Conferences — Patrick Durusau @ 8:25 pm

2nd International Conference on Computational & Mathematical Biomedical Engineering

30th March – 1st April 2011

George Mason University, Washington D.C., USA

Abstract/Expression of interest: 15 November 2010
(see site for other details)

Subjects abound with imaging, analysis, management of data.

Are you ahead of the curve?

November 4, 2010

The Complexity and Application of Syntactic Pattern Recognition Using Finite Inductive Strings

Filed under: Bioinformatics,Biomedical,Pattern Recognition — Patrick Durusau @ 12:26 pm

The Complexity and Application of Syntactic Pattern Recognition Using Finite Inductive Strings Authors: Elijah Myers, Paul S. Fisher, Keith Irwin, Jinsuk Baek, Joao Setubal Keywords: Pattern Recognition, finite induction, syntactic pattern recognition, algorithm complexity

Abstract:

We describe herein the results of implementing an algorithm for syntactic pattern recognition using the concept of Finite Inductive Sequences (FI). We discuss this idea, and then provide a big O estimate of the time to execute for the algorithms. We then provide some empirical data to support the analysis of the timing. This timing is critical if one wants to process millions of symbols from multiple sequences simultaneously. Lastly, we provide an example of the two FI algorithms applied to actual data taken from a gene and then describe some results as well as the associated data derived from this example.

Pattern matching is of obvious important for bioinformatics and in topic map terms, recognizing subjects.

Questions:

  1. What “new problems continue to emerge” that you would use pattern matching to solve? (discussion)
  2. What about those problems makes them suitable for the application of pattern matching? (3-5 pages, no citations)
  3. What about those problems makes them suitable for the particular techniques described in this paper? (3-5 pages, no citations)

Indiana University – Bioinformatics

Filed under: Bioinformatics,Biomedical,Information Retrieval — Patrick Durusau @ 10:37 am

Indiana University – Bioinformatics

The Research & Projects Page offers a sampling of the work underway.

November 2, 2010

Healthcare Terminologies and Classification: Essential Keys to Interoperability

Filed under: Biomedical,Health care,Medical Informatics — Patrick Durusau @ 6:53 am

Healthcare Terminologies and Classification: Essential Keys to Interoperability published by the American Medical Informatics Association and the American Health Information Management Association is a bit dated (2007) but is still a good overview of the area.

Questions:

  1. What are the major initiatives on interoperability of healthcare terminologies today?
  2. What are the primary resources (web/print) for one of those initiatives?
  3. Prepare a one page abstract for each of five articles on one of these initiatives.

November 1, 2010

American Medical Informatics Association

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:31 pm

American Medical Informatics Association

From the website:

AMIA is dedicated to promoting the effective organization, analysis, management, and use of information in health care in support of patient care, public health, teaching, research, administration, and related policy. AMIA’s 4,000 members advance the use of health information and communications technology in clinical care and clinical research, personal health management, public health/population, and translational science with the ultimate objective of improving health.

For over thirty years the members of AMIA and its honorific college, the American College of Medical Informatics (ACMI), have sponsored meetings, education, policy and research programs. The federal government frequently calls upon AMIA as a source of informed, unbiased opinions on policy issues relating to the national health information infrastructure, uses and protection of personal health information, and public health considerations, among others.

Learning the terminology and concerns of an area is the first step towards successful development/application of topic maps.

Questions:

  1. Review the latest four issues of the Journal of the American Medical Informatics Association. (JAMIA)
  2. Select one article with issues that could be addressed by use of a topic map.
  3. How would you use a topic map to address those issues? (3-5 pages, no citations other than the article in question)
  4. Select one article with issues that would be difficult or cannot be addressed using a topic map.
  5. Why would a topic map be difficult to use or cannot address the issues in the article? (3-5 pages, no citations other than the article in question)

Medical Informatics – Formal Training

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:30 pm

Medical Informatics – Formal Training

A listing of formal training opportunities in medical informatics.

Understanding the current state of medical informatics is the starting point for offering topic map based services in health or medical areas.

October 31, 2010

UMLS-SKOS

Filed under: Bioinformatics,Biomedical,SKOS,UMLS — Patrick Durusau @ 7:25 pm

UMLS-SKOS

Abstract:

SKOS is a Semantic Web framework for representing thesauri, classification schemes, subject heading systems, controlled vocabularies, and taxonomies. It enables novel ways of representing terminological knowledge and its linkage with domain knowledge in unambiguous, reusable, and encapsulated fashion within computer applications. According to the National Library of Medicine, the UMLS Knowledge Source (UMLS-KS) integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services “that behave as if they ‘understand’ the meaning of the language of biomedicine and health”. However the current information representation model utilized by UMLS-KS itself is not conducive to computer programs effectively retrieving and automatically and unambiguously interpreting the ‘meaning’ of the biomedical terms and concepts and their relationships.

In this presentation we propose using Simple Knowledge Organization System (SKOS) as an alternative to represent the body of knowledge incorporated within the UMLS-KS within the framework of the Semantic Web technologies. We also introduce our conceptualization of a transformation algorithm to produce an SKOS representation of the UMLS-KS that integrates UMLS-Semantic Network, the UMLS-Metathesaurus complete with all its source vocabularies as a unified body of knowledge along with appropriate information to trace or segregate information based on provenance and governance information. Our proposal and method is based on the idea that formal and explicit representation of any body of knowledge enables its unambiguous, and precise interpretation by automated computer programs. The consequences of such undertaking would be at least three fold: 1) ability to automatically check inconsistencies and errors within a large and complex body of knowledge, 2) automated information interpretation, integration, and discovery, and 3) better information sharing, repurposing and reusing (adoption), and extending the knowledgebase within a distributed and collaborative community of researchers. We submit that UMLS-KS is no exception to this and may benefit from all those advantages if represented fully using a formal representation language. Using SKOS in combination with the transformation algorithm introduced in this presentation are our first steps in that direction. We explain our conceptualization of the algorithms, problems we encountered and how we addressed them with a brief gap analysis to outline the road ahead of us. At the end we also present several use cases from our laboratories at the School of Health information Sciences utilizing this artifact.

WebEx Recording Presentation

Slides

The slides are good but you will need to watch the presentation to give them context.

My only caution concerns:

Our proposal and method is based on the idea that formal and explicit representation of any body of knowledge enables its unambiguous, and precise interpretation by automated computer programs.

I don’t doubt that our computers can return “unambiguous, and precise interpretation[s]” but that isn’t the same thing as “correct” interpretations.

October 28, 2010

Biostar

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 5:31 am

Biostar bills itself as “Questions and answers on bioinformatics, computational genomics and systems biology.”

Still building up but assuming the community gets behind the site it should be a “go to” place the areas it covers.

I mention it here so that:

  • topic mappers can recommend it
  • topic mappers can learn bioinformatics nomenclature

October 27, 2010

Semi-Supervised Graph Embedding Scheme with Active Learning (SSGEAL): Classifying High Dimensional Biomedical Data

Filed under: Bioinformatics,Biomedical,Dimension Reduction — Patrick Durusau @ 4:15 pm

Semi-Supervised Graph Embedding Scheme with Active Learning (SSGEAL): Classifying High Dimensional Biomedical Data Authors: George Lee, Anant Madabhushi

Abstract:

In this paper, we present a new dimensionality reduction (DR) method (SSGEAL) which integrates Graph Embedding (GE) with semi-supervised and active learning to provide a low dimensional data representation that allows for better class separation. Unsupervised DR methods such as Principal Component Analysis and GE have previously been applied to the classification of high dimensional biomedical datasets (e.g. DNA microarrays and digitized histopathology) in the reduced dimensional space. However, these methods do not incorporate class label information, often leading to embeddings with significant overlap between the data classes. Semi-supervised dimensionality reduction (SSDR) methods have recently been proposed which utilize both labeled and unlabeled instances for learning the optimal low dimensional embedding. However, in several problems involving biomedical data, obtaining class labels may be difficult and/or expensive. SSGEAL utilizes labels from instances, identified as “hard to classify” by a support vector machine based active learning algorithm, to drive an updated SSDR scheme while reducing labeling cost. Real world biomedical data from 7 gene expression studies and 3900 digitized images of prostate cancer needle biopsies were used to show the superior performance of SSGEAL compared to both GE and SSAGE (a recently popular SSDR method) in terms of both the Silhouette Index (SI) (SI = 0.35 for GE, SI = 0.31 for SSAGE, and SI = 0.50 for SSGEAL) and the Area Under the Receiver Operating Characteristic Curve (AUC) for a Random Forest classifier (AUC = 0.85 for GE, AUC = 0.93 for SSAGE, AUC = 0.94 for SSGEAL).

Questions:

  1. Literature on information loss from dimension reduction?
  2. Active Learning assisting with topic maps authoring. Yes/no/maybe?
  3. Update the bibliography of one of the papers cited in this paper.

The UMLS Metathesaurus: representing different views of biomedical concepts

Filed under: Bioinformatics,Biomedical,TMDM,TMRM,Topic Maps,UMLS — Patrick Durusau @ 6:14 am

The UMLS Metathesaurus: representing different views of biomedical concepts

Abstract

The UMLS Metathesaurus is a compilation of names, relationships, and associated information from a variety of biomedical naming systems representing different views of biomedical practice or research. The Metathesaurus is organized by meaning, and the fundamental unit in the Metathesaurus is the concept. Differing names for a biomedical meaning are linked in a single Metathesaurus concept. Extensive additional information describing semantic characteristics, occurrence in machine-readable information sources, and how concepts co-occur in these sources is also provided, enabling a greater comprehension of the concept in its various contexts. The Metathesaurus is not a standardized vocabulary; it is a tool for maximizing the usefulness of existing vocabularies. It serves as a knowledge source for developers of biomedical information applications and as a powerful resource for biomedical information specialists.

Bull Med Libr Assoc. 1993 Apr;81(2):217-22.
Schuyler PL, Hole WT, Tuttle MS, Sherertz DD.
Medical Subject Headings Section, National Library of Medicine, Bethesda, MD 20894.

Questions:

  1. Did you notice the date on the citation?
  2. Map this article to the Topic Maps Data Model (3-5 pages, no citations)
  3. Where does the Topic Maps Data Model differ from this article? (3-5 pages, no citations)
  4. If concept = proxy, what concepts (subjects) don’t have proxies in the Metathesaurus?
  5. On what basis are “biomedical meanings” mapped to a single Metathesaurus “concept?” Describe in general but illustrate with at least five (5) examples

October 26, 2010

The Neighborhood Auditing Tool – Update

Filed under: Bioinformatics,Biomedical,Interface Research/Design,Ontology,SNOMED,UMLS — Patrick Durusau @ 7:22 am

The Neighborhood Auditing Tool for the UMLS and its Source Terminologies is a presentation mentioned here several days ago.

If you missed it, go to: http://bioontology.org/neighborhood-audiiting-tool for the slides and WEBEX recording.

Pay close attention to:

The clear emphasis on getting user feedback during the design of the auditing interface.

The “neighborhood” concept he introduces has direct application to XML editing.

Find the “right” way to present parent/child/sibling controls to users and you would have a killer XML application.

Questions:

  1. Slides 8 – 9. Other than saying this is an error (true enough), on what basis is that judgment made?
  2. Slides 18 – 20. Read the references (slide 20) on neighborhoods. Pick another domain, what aspects of neighborhoods are relevant? (3-5 pages, with citations)
  3. Slides 21 – 22. How do your neighborhood graphs compare to those here?
  4. Slides 23 – 46. Short summary of the features of NAT and no citation evaluation. Or, use NAT as basis for development of interface for another domain. (project)
  5. Slides 49 – 55. Visualizations for use and checking. Compare to current literature on visualization of vocabularies/ontologies. (project)
  6. Slides 56 – 58. Snomed browsing. Report on current status. (3-5 pages, citations)
  7. Slices 57 – 73. Work on neighborhoods and extents. To what extent is a “small intersection type” a sub-graph and research on sub-graphs applicable? Any number of issues and questions can be gleaned from this section. (project)

October 25, 2010

Consensus of Ambiguity: Theory and Application of Active Learning for Biomedical Image Analysis

Consensus of Ambiguity: Theory and Application of Active Learning for Biomedical Image Analysis Authors: Scott Doyle, Anant Madabhushi Keywords:

Abstract:

Supervised classifiers require manually labeled training samples to classify unlabeled objects. Active Learning (AL) can be used to selectively label only “ambiguous” samples, ensuring that each labeled sample is maximally informative. This is invaluable in applications where manual labeling is expensive, as in medical images where annotation of specific pathologies or anatomical structures is usually only possible by an expert physician. Existing AL methods use a single definition of ambiguity, but there can be significant variation among individual methods. In this paper we present a consensus of ambiguity (CoA) approach to AL, where only samples which are consistently labeled as ambiguous across multiple AL schemes are selected for annotation. CoA-based AL uses fewer samples than Random Learning (RL) while exploiting the variance between individual AL schemes to efficiently label training sets for classifier training. We use a consensus ratio to determine the variance between AL methods, and the CoA approach is used to train classifiers for three different medical image datasets: 100 prostate histopathology images, 18 prostate DCE-MRI patient studies, and 9,000 breast histopathology regions of interest from 2 patients. We use a Probabilistic Boosting Tree (PBT) to classify each dataset as either cancer or non-cancer (prostate), or high or low grade cancer (breast). Trained is done using CoA-based AL, and is evaluated in terms of accuracy and area under the receiver operating characteristic curve (AUC). CoA training yielded between 0.01-0.05% greater performance than RL for the same training set size; approximately 5-10 more samples were required for RL to match the performance of CoA, suggesting that CoA is a more efficient training strategy.

The consensus of ambiguity (CoA) is trivially extensible to other image analysis. Intelligence photos anyone?

What intrigues me is extension of that approach to other types of data analysis.

Such as having multiple AL schemes process textual data and follow the CoA approach on what to bounce to experts for annotation.

Questions:

  1. What types of ambiguity would this approach miss?
  2. How would you apply this method to other data?
  3. How would you measure success/failure of application to other data?
  4. Design and apply this concept to specified data set. (project)

October 24, 2010

The Role of Sparse Data Representation in Semantic Image Understanding

Filed under: Bioinformatics,Biomedical,Image Understanding,Sparse Image Representation — Patrick Durusau @ 10:56 am

The Role of Sparse Data Representation in Semantic Image Understanding Author: Artur Przelaskowski Keywords: Computational intelligence, image understanding, sparse image representation, nonlinear approximation, semantic information theory

Abstract:

This paper discusses a concept of computational understanding of medical images in a context of computer-aided diagnosis. Fundamental research purpose was improved diagnosis of the cases, formulated by human experts. Designed methods of soft computing with extremely important role of: a) semantically sparse data representation, b) determined specific information, formally and experimentally, and c) computational intelligence approach were adjusted to the challenges of image-based diagnosis. Formalized description of image representation procedures was completed with exemplary results of chosen applications, used to explain formulated concepts, to make them more pragmatic and assure diagnostic usefulness. Target pathology was ontologically described, characterized by as stable as possible patterns, numerically described using semantic descriptors in sparse representation. Adjusting of possible source pathology to computational map of target pathology was fundamental issue of considered procedures. Computational understanding means: a) putting together extracted and numerically described content, b) recognition of diagnostic meaning of content objects and their common significance, and c) verification by comparative analysis with all accessible information and knowledge sources (patient record, medical lexicons, the newest communications, reference databases, etc.).

Interesting in its own right for image analysis in the important area of medical imaging but caught my eye for another reason.

Sparse data representation works for understanding images.

Would it work in other semantic domains?

Questions:

  1. What are the minimal clues that enable us to understand a particular text?
  2. Can we learn those clues before we encounter a particular text?
  3. Can we create clues for others to use when encountering a particular text?
  4. How would we identify the text for application of our clues?

Introduction to Biomedical Ontologies

Filed under: Biomedical,Ontology — Patrick Durusau @ 9:58 am

A very good introduction to ontologies: Introduction to Biomedical Ontologies.

This introduction neatly frames the issue addressed by both controlled vocabularies (ontologies) and topic maps.

When faced with multiple terms for a single subject, a controlled vocabulary (ontology), solves the problem by using a single term.

Other terms that mean the same subject are “near synonyms.”

Watch the video and then check back here for a post called: Near Synonyms.

I will discuss how the treatment of “near synonyms” differs between topic maps and controlled vocabularies (ontologies).

October 23, 2010

SLiMSearch: A Webserver for Finding Novel Occurrences of Short Linear Motifs in Proteins, Incorporating Sequence Context

Filed under: Bioinformatics,Biomedical,Pattern Recognition,Subject Identity — Patrick Durusau @ 5:56 am

SLiMSearch: A Webserver for Finding Novel Occurrences of Short Linear Motifs in Proteins, Incorporating Sequence Context Authors: Norman E. Davey, Niall J. Haslam, Denis C. Shields, Richard J. Edwards Keywords: short linear motif, motif discovery, minimotif, elm

Short, linear motifs (SLiMs) play a critical role in many biological processes. The SLiMSearch (Short, Linear Motif Search) webserver is a flexible tool that enables researchers to identify novel occurrences of pre-defined SLiMs in sets of proteins. Numerous masking options give the user great control over the contextual information to be included in the analyses, including evolutionary filtering and protein structural disorder. User-friendly output and visualizations of motif context allow the user to quickly gain insight into the validity of a putatively functional motif occurrence. Users can search motifs against the human proteome, or submit their own datasets of UniProt proteins, in which case motif support within the dataset is statistically assessed for over- and under-representation, accounting for evolutionary relationships between input proteins. SLiMSearch is freely available as open source Python modules and all webserver results are available for download. The SLiMSearch server is available at: http://bioware.ucd.ie/slimsearch.html .

Software: http://bioware.ucd.ie/slimsearch.html

Seemed like an appropriate resource to follow on today’s earlier posting.

Note in the keywords, “elm.”

Care to guess what that means? If you are a bioinformatics or biology person you may get it correct.

What do you think the odds are that any person much less a general search engine will get it correct?

Topic maps are about making sure you find: Eukaryotic Linear Motif Resource without wading through what a search of any common search engine returns for “elm.”

Questions:

  1. What other terms in this paper represent other subjects?
  2. What properties would you use to identify those subjects?
  3. How would you communicate those subjects to someone else?

An Algorithm to Find All Identical Motifs in Multiple Biological Sequences

Filed under: Bioinformatics,Biomedical,Pattern Recognition,Subject Identity — Patrick Durusau @ 5:25 am

An Algorithm to Find All Identical Motifs in Multiple Biological Sequences Authors: Ashish Kishor Bindal, R. Sabarinathan, J. Sridhar, D. Sherlin, K. Sekar Keywords: Sequence motifs, nucleotide and protein sequences, identical motifs, dynamic programming, direct repeat and phylogenetic relationships

Sequence motifs are of greater biological importance in nucleotide and protein sequences. The conserved occurrence of identical motifs represents the functional significance and helps to classify the biological sequences. In this paper, a new algorithm is proposed to find all identical motifs in multiple nucleotide or protein sequences. The proposed algorithm uses the concept of dynamic programming. The application of this algorithm includes the identification of (a) conserved identical sequence motifs and (b) identical or direct repeat sequence motifs across multiple biological sequences (nucleotide or protein sequences). Further, the proposed algorithm facilitates the analysis of comparative internal sequence repeats for the evolutionary studies which helps to derive the phylogenetic relationships from the distribution of repeats.

Good illustration that subject identification, here sequence motifs in nucleotide and protein sequences, varies by domain.

Subject matching in this type of data on the basis of assigned URL identifiers for sequence motifs would be silly.

But that’s the question isn’t it? What is the appropriate basis for subject matching in a particular domain?

Questions:

  1. Identify and describe one (1) domain where URL matching for subjects would be unnecessary overhead. (3 pages, no citations)
  2. Identify and describe one (1) domain where URL matching for subjects would be useful. (3 pages, no citations)
  3. What are the advantages of URLs as a lingua franca? (3 pages, no citations)
  4. What are the disadvantages of URLs as a lingua franca? (3 pages, no citations)

***
BTW, when you see “no citations” that does not mean you should not be reading the relevant literature. What is means is that I want your analysis of the issues and not your channeling of the latest literature.

October 22, 2010

National Center for Biomedical Ontology

Filed under: Biomedical,Health care,Ontology — Patrick Durusau @ 6:00 am

National Center for Biomedical Ontology

I feel like a kid in a candy store at this site.

I suppose it is being an academic researcher at heart.

Reports on specific resources to follow.

October 20, 2010

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification
Authors: Aaron Smalter, Jun Huan and Gerald Lushington

Abstract:

Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogeneous sources and have been applied in modeling many kinds of biological data. With the fast accumulation of chemical structure data, building highly accurate predictive models for chemical graphs emerges as a new challenge.

In this paper, we demonstrate a novel technique called Graph Pattern Matching kernel (GPM). Our idea is to leverage existing ? frequent pattern discovery methods and explore their application to kernel classifiers (e.g. support vector machine) for graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the database and use a diffusion process to label nodes in the graphs. Finally the kernel is computed using a set matching algorithm. We performed experiments on 16 chemical structure data sets and have compared our methods to other major graph kernels. The experimental results demonstrate excellent performance of our method.

The authors also note:

Publicly-available large-scale chemical compound databases have offered tremendous opportunities for creating highly efficient in silico drug design methods. Many machine learning and data mining algorithms have been applied to study the structure-activity relationship of chemicals with the goal of building classifiers for graph-structured data.

In other words, with a desktop machine, public data and a little imagination, you can make a fundamental contribution to drug design methods. (FWI, the pharmaceuticals are making money hand over fist.)

Integrating your contribution or its results into existing information, such as with topic maps, will only increase its value.

Integrating Biological Data – Not A URL In Sight!

Actual title: Kernel methods for integrating biological data by Dick de Ridder, The Delft Bioinformatics Lab, Delft University of Technology.

Biological data integration to improve protein expression – read hugely profitable industrial processes based on biology.

Need to integrate biological data, including “prior knowledge.”

In case kernel methods aren’t your “thing,” one important point:

There are vast seas of economically important data unsullied by URLs.

Kernel methods are one method to integrate some of that data.

Questions:

  1. How to integrate kernel methods into topic maps? (research project)
  2. Subjects in a kernel method? (research paper, limit to one method)
  3. Modeling specific uses of kernels in topic maps. (research project)
  4. Edges of kernels? Are there subject limits to kernels? (research project>

September 27, 2010

Employing Publically Available Biological Expert Knowledge from Protein-Protein Interaction Information

Filed under: Bioinformatics,Biomedical,Subject Identity — Patrick Durusau @ 7:18 pm

Employing Publically Available Biological Expert Knowledge from Protein-Protein Interaction Information Authors: Kristine A. Pattin, Jiang Gui, Jason H. Moore Keywords: GWAS – SNPs – Protien-protein interaction – Epistasis

Abstract:

Genome wide association studies (GWAS) are now allowing researchers to probe the depths of common complex human diseases, yet few have identified single sequence variants that confer disease susceptibility. As hypothesized, this is due the fact that multiple interacting factors influence clinical endpoint. Given the number of single nucleotide polymorphisms (SNPs) combinations grows exponentially with the number of SNPs being analyzed, computational methods designed to detect these interactions in smaller datasets are thus not applicable. Providing statistical expert knowledge has exhibited an improvement in their performance, and we believe biological expert knowledge to be as capable. Since one of the strongest demonstrations of the functional relationship between genes is protein-protein interactions, we present a method that exploits this information in genetic analyses. This study provides a step towards utilizing expert knowledge derived from public biological sources to assist computational intelligence algorithms in the search for epistasis.

Applying human knowledge “…to assist computational intelligence algorithms…,” sounds like subject identity and topic maps to me!

September 23, 2010

HUGO Gene Nomenclature Committee

Filed under: Bioinformatics,Biomedical,Data Mining,Entity Extraction,Indexing,Software — Patrick Durusau @ 8:32 am

HUGO Gene Nomenclature Committee, a committee assigning unique names to genes.

Become familiar with the HUGO site, then read: The success (or not) of HUGO nomenclature (Genome Biology, 2006).

Now read: Moara: a Java library for extracting and normalizing gene and protein mentions (BMC Bioinformatics 2010)

Q: How you would apply the techniques in the Moara article to build a topic map? Would you keep/discard normalization?

PS: Moara Project (software, etc.)

September 19, 2010

Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data

Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data

Destined to be a deeply influential resource.

Read the paper, use the application for a week Chem2Bio2RDF, then answer these questions:

  1. Choose three (3) subjects that are identified in this framework.
  2. For each subject, how is it identified in this framework?
  3. For each subject, have you seen it in another framework or system?
  4. For each subject seen in another framework/system, how was it identified there?

Extra credit: What one thing would you change about any of the identifications in this system? Why?

September 15, 2010

1st ACM International Health Informatics Symposium – November 11-12, 2010

Filed under: Biomedical,Conferences,Health care — Patrick Durusau @ 5:48 am

1st ACM International Health Informatics Symposium – November 11-12, 2010.

Interesting presentations:

  • The Effect of Different Context Representations on Word Sense Discrimination in Biomedical Texts
  • An evaluation of feature sets and sampling techniques for de-identification of medical records
  • Federated Querying Architecture for Clinical & Translational Health IT
  • Contextualizing consumer health information searching: an analysis of questions in a social Q&A community

Will watch for the call for papers for next year. Would be nice to have a topic map paper or two on the program.

« Newer Posts

Powered by WordPress