Archive for the ‘Alignment’ Category

Ontology Matching

Tuesday, December 24th, 2013

Ontology Matching: Proceedings of the 8th International Workshop on Ontology Matching, co-located with the 12th International Semantic Web Conference (ISWC 2013) edited by Pavel Shvaiko, et. al.

Technical papers:

The Ontology Alignment Evaluation Initiative 2013 results are represented by seventeen (17) papers.

In addition, there are eleven (11) posters.

Complete proceedings in one PDF file.

Ontologies are a popular answer to semantic diversity.

You might say the more semantic diversity in a field, the greater the number of ontologies it has. 😉

A natural consequence of the proliferation of ontologies is the need to match or map between them.

As you know, I prefer to capture the reasons for mappings to avoid repeating the exercise over and over but that’s not a universal requirement.

If you have an hourly contract for mapping between ontologies, you may not want to lessen the burden of such mapping, year in and year out.

And for some purposes, mechanical mappings may be sufficient.

This work is a good update on the current state of the art for matching ontologies.

Manual Alignment of Anatomy Ontologies

Saturday, January 12th, 2013

Matching arthropod anatomy ontologies to the Hymenoptera Anatomy Ontology: results from a manual alignment by Matthew A. Bertone, István Mikó, Matthew J. Yoder, Katja C. Seltmann, James P. Balhoff, and Andrew R. Deans. (Database (2013) 2013 : bas057 doi: 10.1093/database/bas057)

Abstract:

Matching is an important step for increasing interoperability between heterogeneous ontologies. Here, we present alignments we produced as domain experts, using a manual mapping process, between the Hymenoptera Anatomy Ontology and other existing arthropod anatomy ontologies (representing spiders, ticks, mosquitoes and Drosophila melanogaster). The resulting alignments contain from 43 to 368 mappings (correspondences), all derived from domain-expert input. Despite the many pairwise correspondences, only 11 correspondences were found in common between all ontologies, suggesting either major intrinsic differences between each ontology or gaps in representing each group’s anatomy. Furthermore, we compare our findings with putative correspondences from Bioportal (derived from LOOM software) and summarize the results in a total evidence alignment. We briefly discuss characteristics of the ontologies and issues with the matching process.

Database URL: http://purl.obolibrary.org/obo/hao/2012-07-18/arthropod-mappings.obo.

A great example of the difficulty of matching across ontologies, particularly when the granularity or subjects of ontologies vary.

Optimal simultaneous superpositioning of multiple structures with missing data

Friday, July 20th, 2012

Optimal simultaneous superpositioning of multiple structures with missing data (Douglas L. Theobald and Phillip A. Steindel Optimal simultaneous superpositioning of multiple structures with missing data Bioinformatics 2012 28: 1972-1979. )

Abstract:

Motivation: Superpositioning is an essential technique in structural biology that facilitates the comparison and analysis of conformational differences among topologically similar structures. Performing a superposition requires a one-to-one correspondence, or alignment, of the point sets in the different structures. However, in practice, some points are usually ‘missing’ from several structures, for example, when the alignment contains gaps. Current superposition methods deal with missing data simply by superpositioning a subset of points that are shared among all the structures. This practice is inefficient, as it ignores important data, and it fails to satisfy the common least-squares criterion. In the extreme, disregarding missing positions prohibits the calculation of a superposition altogether.

Results: Here, we present a general solution for determining an optimal superposition when some of the data are missing. We use the expectation–maximization algorithm, a classic statistical technique for dealing with incomplete data, to find both maximum-likelihood solutions and the optimal least-squares solution as a special case.

Availability and implementation: The methods presented here are implemented in THESEUS 2.0, a program for superpositioning macromolecular structures. ANSI C source code and selected compiled binaries for various computing platforms are freely available under the GNU open source license from http://www.theseus3d.org.

Contact: dtheobald@brandeis.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

From the introduction:

How should we properly compare and contrast the 3D conformations of similar structures? This fundamental problem in structural biology is commonly addressed by performing a superposition, which removes arbitrary differences in translation and rotation so that a set of structures is oriented in a common reference frame (Flower, 1999). For instance, the conventional solution to the superpositioning problem uses the least-squares optimality criterion, which orients the structures in space so as to minimize the sum of the squared distances between all corresponding points in the different structures. Superpositioning problems, also known as Procrustes problems, arise frequently in many scientific fields, including anthropology, archaeology, astronomy, computer vision, economics, evolutionary biology, geology, image analysis, medicine, morphometrics, paleontology, psychology and molecular biology (Dryden and Mardia, 1998; Gower and Dijksterhuis, 2004; Lele and Richtsmeier, 2001). A particular case we consider here is the superpositioning of multiple 3D macromolecular coordinate sets, where the points to be superpositioned correspond to atoms. Although our analysis specifically concerns the conformations of macromolecules, the methods developed herein are generally applicable to any entity that can be represented as a set of Cartesian points in a multidimensional space, whether the particular structures under study are proteins, skulls, MRI scans or geological strata.

We draw an important distinction here between a structural ‘alignment’ and a ‘superposition.’ An alignment is a discrete mapping between the residues of two or more structures. One of the most common ways to represent an alignment is using the familiar row and column matrix format of sequence alignments using the single letter abbreviations for residues (Fig. 1). An alignment may be based on sequence information or on structural information (or on both). A superposition, on the other hand, is a particular orientation of structures in 3D space. [emphasis added]

I have deep reservations about the representations of semantics using Cartesian metrics but in fact that happens quite frequently. And allegedly, usefully.

Leaving my doubts to one side, this superpositioning technique could prove to be a useful exploration technique.

If you experiment with this technique, a report of your experiences would be appreciated.

…a phylogeny-aware graph algorithm

Monday, June 25th, 2012

Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm by Loytynoja, A., Vilella, A. J., Goldman, N.

From the post:

Motivation: Accurate alignment of large numbers of sequences is demanding and the computational burden is further increased by downstream analyses depending on these alignments. With the abundance of sequence data, an integrative approach of adding new sequences to existing alignments without their full re-computation and maintaining the relative matching of existing sequences is an attractive option. Another current challenge is the extension of reference alignments with fragmented sequences, as those coming from next-generation metagenomics, that contain relatively little information. Widely used methods for alignment extension are based on profile representation of reference sequences. These do not incorporate and use phylogenetic information and are affected by the composition of the reference alignment and the phylogenetic positions of query sequences.

Results: We have developed a method for phylogeny-aware alignment of partial-order sequence graphs and apply it here to the extension of alignments with new data. Our new method, called PAGAN, infers ancestral sequences for the reference alignment and adds new sequences in their phylogenetic context, either to predefined positions or by finding the best placement for sequences of unknown origin. Unlike profile-based alternatives, PAGAN considers the phylogenetic relatedness of the sequences and is not affected by inclusion of more diverged sequences in the reference set. Our analyses show that PAGAN outperforms alternative methods for alignment extension and provides superior accuracy for both DNA and protein data, the improvement being especially large for fragmented sequences. Moreover, PAGAN-generated alignments of noisy next-generation sequencing (NGS) sequences are accurate enough for the use of RNA-seq data in evolutionary analyses.

Availability: PAGAN is written in C++, licensed under the GPL and its source code is available at http://code.google.com/p/pagan-msa.

Contact: ari.loytynoja@helsinki.fi

Does your graph software support “…phylogeny-aware alignment of partial-order sequence graphs…?”

Almagame

Saturday, January 14th, 2012

Almagame

Almagame is the software that Tim Wray mentions in his post, vocabulary alignment, meaning and understanding in the world museum, as using a technique called “interactive alignment.”

From the homepage:

Amalgame (AMsterdam ALignment GenerAtion MEtatool) is a tool for finding, evaluating and managing vocabulary alignments. We explicitly do not aim to produce ‘yet another alignment method’ but rather seek to combine existing matching techniques and methods such as those developed within the context of the Ontology Alignment Evaluation Initiative (OAEI), in which different alignment methods can be combined using a workflow setup. The Amalgame Alignment server will feature:

  • A workflow composition functionality, where various alignment generators can be positioned. Their resulting mapping sets can be used as input for filtering methods, other alignment generators or combined into overlap sets.
  • A statistics function, where statistics for alignment sets will be shown
  • An evaluation tool, where subsets of alignments can be evaluated manually

Vocabulary and metadata workflow

The Amalgame toolkit realises the second step of a workflow specified by the Europeana Connect project for SKOSifying vocabularies and converting collection metadata into the EDM (Europeana Data Model). The first step, conversion of XML data into RDF, is supported by the XMLRDF tool.

Amalgame paper at TPDL 2011

We’re happy to announce our paper about Amalgam was accepted as a full paper for the International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011).

The extensive online appendix also contains a rich use case description.

I think you will want to grab the paper, which has the following abstract:

In many heritage institutes, objects are routinely described using terms from predefined vocabularies. When object collections need to be merged or linked, the question arises how those vocabularies relate. In practice it often unclear for data providers how well alignment tools will perform on their specific vocabularies. This creates a bottleneck to align vocabularies, as data providers want to have tight control over the quality of their data. We will discuss the key limitations of current tools in more detail and propose an alternative approach. We will show how this approach has been used in two alignment use cases, and demonstrate how it is currently supported by our Amalgame alignment platform.

I am downloading/installing the software.

I am curious if a similar approach, albeit without converting data into RDF, could be used to create alignments of unstructured vocabularies? Along with reasons for the mappings between vocabularies?

Reasoning in part that there are far more non-structured vocabularies where access could be enhanced with mappings to other vocabularies, along with reasons for the mappings.