Archive for the ‘Multidimensional’ Category

An Indexing Structure for Dynamic Multidimensional Data in Vector Space

Saturday, September 1st, 2012

An Indexing Structure for Dynamic Multidimensional Data in Vector Space by Elena Mikhaylova, Boris Novikov and Anton Volokhov. (Advances in Databases and Information Systems, Advances in Intelligent Systems and Computing, 2013, Volume 186, 185-193, DOI: 10.1007/978-3-642-32741-4_17)

Abstract:

The multidimensional k – NN (k nearest neighbors) query problem is relevant to a large variety of database applications, including information retrieval, natural language processing, and data mining. To solve it efficiently, the database needs an indexing structure that provides this kind of search. However, attempts to find an exact solution are hardly feasible in multidimensional space. In this paper, a novel indexing technique for the approximate solution of k – NN problem is described and analyzed. The construction of the indexing tree is based on clustering. Indexing structure is implemented on top of high-performance industrial DBMS.

The review of recent work is helpful but when the paper reaches the algorithm for indexing “…dynamic multidimensional data…,” it slips away from me.

Where is the dynamic nature of the data that is being overcome by the indexing?

I ask because we are human observers are untroubled by the curse of dimensionality, even when data is dynamically changing.

Although those are two important aspects when we process it by machine:

  • The number of dimensions of data, and
  • The rate at which the data is changing.

Optimal simultaneous superpositioning of multiple structures with missing data

Friday, July 20th, 2012

Optimal simultaneous superpositioning of multiple structures with missing data (Douglas L. Theobald and Phillip A. Steindel Optimal simultaneous superpositioning of multiple structures with missing data Bioinformatics 2012 28: 1972-1979. )

Abstract:

Motivation: Superpositioning is an essential technique in structural biology that facilitates the comparison and analysis of conformational differences among topologically similar structures. Performing a superposition requires a one-to-one correspondence, or alignment, of the point sets in the different structures. However, in practice, some points are usually ‘missing’ from several structures, for example, when the alignment contains gaps. Current superposition methods deal with missing data simply by superpositioning a subset of points that are shared among all the structures. This practice is inefficient, as it ignores important data, and it fails to satisfy the common least-squares criterion. In the extreme, disregarding missing positions prohibits the calculation of a superposition altogether.

Results: Here, we present a general solution for determining an optimal superposition when some of the data are missing. We use the expectation–maximization algorithm, a classic statistical technique for dealing with incomplete data, to find both maximum-likelihood solutions and the optimal least-squares solution as a special case.

Availability and implementation: The methods presented here are implemented in THESEUS 2.0, a program for superpositioning macromolecular structures. ANSI C source code and selected compiled binaries for various computing platforms are freely available under the GNU open source license from http://www.theseus3d.org.

Contact: dtheobald@brandeis.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

From the introduction:

How should we properly compare and contrast the 3D conformations of similar structures? This fundamental problem in structural biology is commonly addressed by performing a superposition, which removes arbitrary differences in translation and rotation so that a set of structures is oriented in a common reference frame (Flower, 1999). For instance, the conventional solution to the superpositioning problem uses the least-squares optimality criterion, which orients the structures in space so as to minimize the sum of the squared distances between all corresponding points in the different structures. Superpositioning problems, also known as Procrustes problems, arise frequently in many scientific fields, including anthropology, archaeology, astronomy, computer vision, economics, evolutionary biology, geology, image analysis, medicine, morphometrics, paleontology, psychology and molecular biology (Dryden and Mardia, 1998; Gower and Dijksterhuis, 2004; Lele and Richtsmeier, 2001). A particular case we consider here is the superpositioning of multiple 3D macromolecular coordinate sets, where the points to be superpositioned correspond to atoms. Although our analysis specifically concerns the conformations of macromolecules, the methods developed herein are generally applicable to any entity that can be represented as a set of Cartesian points in a multidimensional space, whether the particular structures under study are proteins, skulls, MRI scans or geological strata.

We draw an important distinction here between a structural ‘alignment’ and a ‘superposition.’ An alignment is a discrete mapping between the residues of two or more structures. One of the most common ways to represent an alignment is using the familiar row and column matrix format of sequence alignments using the single letter abbreviations for residues (Fig. 1). An alignment may be based on sequence information or on structural information (or on both). A superposition, on the other hand, is a particular orientation of structures in 3D space. [emphasis added]

I have deep reservations about the representations of semantics using Cartesian metrics but in fact that happens quite frequently. And allegedly, usefully.

Leaving my doubts to one side, this superpositioning technique could prove to be a useful exploration technique.

If you experiment with this technique, a report of your experiences would be appreciated.

Globalsdb

Saturday, April 9th, 2011

Globalsdb

Jack Park forwarded this to my attention.

I am puzzling over:

At its core, the Globals database is powered by an extremely efficient multidimensional data engine. The exposed interface support access to the multidimensional structures – providing the highest performance and greatest range of storage possibilities. A multitude of applications can be implemented entirely using this data engine directly.

There is no data dictionary, and thus no data definitions, for the multidimensional data engine.

I “get” the part about extremely efficient multidimensional data engine (they say it often enough) but am curious why there is no data dictionary? Or at least why is that a claim to put up front?

Granting that I don’t consider data dictionaries to be self-describing but then neither are multidimensional arrays. Necessarily.

This database apparently lies at the core of a commercial application or line of commercial applications by Intersystems Corporation.