HyperDex: A Distributed, Searchable Key-Value Store for Cloud Computing by Robert Escrivay, Bernard Wongz and Emin Güun Sirery.
Abstract:
Distributed key-value stores are now a standard component of high-performance web services and cloud computing applications. While key-value stores offer significant performance and scalability advantages compared to traditional databases, they achieve these properties through a restricted API that limits object retrieval—an object can only be retrieved by the (primary and only) key under which it was inserted. This paper presents HyperDex, a novel distributed key-value store that provides a unique search primitive that enables queries on secondary attributes. The key insight behind HyperDex is the concept of hyperspace hashing in which objects with multiple attributes are mapped into a multidimensional hyperspace. This mapping leads to efficient implementations not only for retrieval by primary key, but also for partially-specified secondary attribute searches and range queries. A novel chaining protocol enables the system to provide strong consistency guarantees while supporting replication. An evaluation of the full system shows that HyperDex is orders of magnitude faster than Cassandra and MongoDB for finding partially specified objects. Additionally, HyperDex achieves high performance for simple get/put operations compared to current state-of-the-art key-value stores, with stronger fault tolerance and comparable scalability properties.
This paper merited a separate posting from the software.
Among many interesting points was the following one from the introduction:
A naive Euclidean space construction, however, can suffer from the “curse of dimensionality,” as the space exhibits an exponential increase in volume with each additional secondary attribute [8]. For objects with many attributes, the resulting Euclidean space would be large, and consequently, sparse. Nodes would then be responsible for large regions in the hyperspace, which would increase the number of nodes whose regions intersect search hyperplanes and thus limit the effectiveness of the basic approach. HyperDex addresses this problem by introducing an efficient and lightweight mechanism that partitions the data into smaller, limited-size sub-spaces, where each subspace covers a subset of object attributes in a lower dimensional hyperspace. Thus, by folding the hyperspace back into a lower number of dimensions, HyperDex can ensure higher node selectivity during searches.
Something keeps nagging at me about the use of the term Euclidean space. Since a Euclidean space is a metric space, I “get” how they can partition metric data into smaller sub-spaces.
Names don’t exist in metric spaces but sort orders and frequencies are known well enough to approximate such a solution. Or are they? I assume for more common languages that is the case but that is likely a poor assumption on my part.
What of other non-metric space values? On what basis would they be partitioned?